linux-kernel - Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4z_ftxvG-EcTe=X+Te8fNSShhVHHPvbEgAa1rQXgO5XCA@mail.gmail.com>
Date:   Tue, 28 Nov 2023 13:11:19 +1300
From:   Barry Song <21cnbao@...il.com>
To:     Ryan Roberts <ryan.roberts@....com>
Cc:     akpm@...ux-foundation.org, andreyknvl@...il.com,
        anshuman.khandual@....com, ardb@...nel.org,
        catalin.marinas@....com, david@...hat.com, dvyukov@...gle.com,
        glider@...gle.com, james.morse@....com, jhubbard@...dia.com,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, mark.rutland@....com, maz@...nel.org,
        oliver.upton@...ux.dev, ryabinin.a.a@...il.com,
        suzuki.poulose@....com, vincenzo.frascino@....com,
        wangkefeng.wang@...wei.com, will@...nel.org, willy@...radead.org,
        yuzenghui@...wei.com, yuzhao@...gle.com, ziy@...dia.com
Subject: Re: [PATCH v2 01/14] mm: Batch-copy PTE ranges during fork()

On Mon, Nov 27, 2023 at 10:24 PM Ryan Roberts <ryan.roberts@....com> wrote:
>
> On 27/11/2023 05:54, Barry Song wrote:
> >> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >> +              pte_t *dst_pte, pte_t *src_pte,
> >> +              unsigned long addr, unsigned long end,
> >> +              int *rss, struct folio **prealloc)
> >>  {
> >>      struct mm_struct *src_mm = src_vma->vm_mm;
> >>      unsigned long vm_flags = src_vma->vm_flags;
> >>      pte_t pte = ptep_get(src_pte);
> >>      struct page *page;
> >>      struct folio *folio;
> >> +    int nr = 1;
> >> +    bool anon;
> >> +    bool any_dirty = pte_dirty(pte);
> >> +    int i;
> >>
> >>      page = vm_normal_page(src_vma, addr, pte);
> >> -    if (page)
> >> +    if (page) {
> >>              folio = page_folio(page);
> >> -    if (page && folio_test_anon(folio)) {
> >> -            /*
> >> -             * If this page may have been pinned by the parent process,
> >> -             * copy the page immediately for the child so that we'll always
> >> -             * guarantee the pinned page won't be randomly replaced in the
> >> -             * future.
> >> -             */
> >> -            folio_get(folio);
> >> -            if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
> >> -                    /* Page may be pinned, we have to copy. */
> >> -                    folio_put(folio);
> >> -                    return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> >> -                                             addr, rss, prealloc, page);
> >> +            anon = folio_test_anon(folio);
> >> +            nr = folio_nr_pages_cont_mapped(folio, page, src_pte, addr,
> >> +                                            end, pte, &any_dirty);
> >
> > in case we have a large folio with 16 CONTPTE basepages, and userspace
> > do madvise(addr + 4KB * 5, DONTNEED);
>
> nit: if you are offsetting by 5 pages from addr, then below I think you mean
> page0~page4 and page6~15?
>
> >
> > thus, the 4th basepage of PTE becomes PTE_NONE and folio_nr_pages_cont_mapped()
> > will return 15. in this case, we should copy page0~page3 and page5~page15.
>
> No I don't think folio_nr_pages_cont_mapped() will return 15; that's certainly
> not how its intended to work. The function is scanning forwards from the current
> pte until it finds the first pte that does not fit in the batch - either because
> it maps a PFN that is not contiguous, or because the permissions are different
> (although this is being relaxed a bit; see conversation with DavidH against this
> same patch).
>
> So the first time through this loop, folio_nr_pages_cont_mapped() will return 5,
> (page0~page4) then the next time through the loop we will go through the
> !present path and process the single swap marker. Then the 3rd time through the
> loop folio_nr_pages_cont_mapped() will return 10.

one case we have met by running hundreds of real phones is as below,


static int
copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
               pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
               unsigned long end)
{
        ...
        dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
        if (!dst_pte) {
                ret = -ENOMEM;
                goto out;
        }
        src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
        if (!src_pte) {
                pte_unmap_unlock(dst_pte, dst_ptl);
                /* ret == 0 */
                goto out;
        }
        spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
        orig_src_pte = src_pte;
        orig_dst_pte = dst_pte;
        arch_enter_lazy_mmu_mode();

        do {
                /*
                 * We are holding two locks at this point - either of them
                 * could generate latencies in another task on another CPU.
                 */
                if (progress >= 32) {
                        progress = 0;
                        if (need_resched() ||
                            spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
                }
                ptent = ptep_get(src_pte);
                if (pte_none(ptent)) {
                        progress++;
                        continue;
                }

the above iteration can break when progress > =32. for example, at the
beginning,
if all PTEs are none, we break when progress >=32, and we break when we
are in the 8th pte of 16PTEs which might become CONTPTE after we release
PTL.

since we are releasing PTLs, next time when we get PTL, those pte_none() might
become pte_cont(), then are you going to copy CONTPTE from 8th pte,
thus, immediately
break the consistent CONPTEs rule of hardware?

pte0 - pte_none
pte1 - pte_none
...
pte7 - pte_none

pte8 - pte_cont
...
pte15 - pte_cont

so we did some modification to avoid a break in the middle of PTEs
which can potentially
become CONTPE.
do {
                /*
                * We are holding two locks at this point - either of them
                * could generate latencies in another task on another CPU.
                */
                if (progress >= 32) {
                                progress = 0;
#ifdef CONFIG_CONT_PTE_HUGEPAGE
                /*
                * XXX: don't release ptl at an unligned address as
cont_pte might form while
                * ptl is released, this causes double-map
                */
                if (!vma_is_chp_anonymous(src_vma) ||
                   (vma_is_chp_anonymous(src_vma) && IS_ALIGNED(addr,
HPAGE_CONT_PTE_SIZE)))
#endif
                if (need_resched() ||
                   spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
                                break;
}

We could only reproduce the above issue by running thousands of phones.

Does your code survive from this problem?

>
> Thanks,
> Ryan
>
> >
> > but the current code is copying page0~page14, right? unless we are immediatly
> > split_folio to basepages in zap_pte_range(), we will have problems?
> >
> >> +
> >> +            for (i = 0; i < nr; i++, page++) {
> >> +                    if (anon) {
> >> +                            /*
> >> +                             * If this page may have been pinned by the
> >> +                             * parent process, copy the page immediately for
> >> +                             * the child so that we'll always guarantee the
> >> +                             * pinned page won't be randomly replaced in the
> >> +                             * future.
> >> +                             */
> >> +                            if (unlikely(page_try_dup_anon_rmap(
> >> +                                            page, false, src_vma))) {
> >> +                                    if (i != 0)
> >> +                                            break;
> >> +                                    /* Page may be pinned, we have to copy. */
> >> +                                    return copy_present_page(
> >> +                                            dst_vma, src_vma, dst_pte,
> >> +                                            src_pte, addr, rss, prealloc,
> >> +                                            page);
> >> +                            }
> >> +                            rss[MM_ANONPAGES]++;
> >> +                            VM_BUG_ON(PageAnonExclusive(page));
> >> +                    } else {
> >> +                            page_dup_file_rmap(page, false);
> >> +                            rss[mm_counter_file(page)]++;
> >> +                    }
> >

Thanks
Barry