linux-kernel - Re: [PATCH 4/4] mm: Make most walk page paths with pmd_trans

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZH41YzZ0DBoF8csH@x1n>
Date:   Mon, 5 Jun 2023 15:20:03 -0400
From:   Peter Xu <peterx@...hat.com>
To:     Yang Shi <shy828301@...il.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        David Hildenbrand <david@...hat.com>,
        Alistair Popple <apopple@...dia.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Johannes Weiner <hannes@...xchg.org>,
        John Hubbard <jhubbard@...dia.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Muhammad Usama Anjum <usama.anjum@...labora.com>,
        Hugh Dickins <hughd@...gle.com>,
        Mike Rapoport <rppt@...nel.org>
Subject: Re: [PATCH 4/4] mm: Make most walk page paths with
 pmd_trans_unstable() to retry

On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <peterx@...hat.com> wrote:
> >
> > For most of the page walk paths, logically it'll always be good to have the
> > pmd retries if hit pmd_trans_unstable() race.  We can treat it as none
> > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > even treating that as a none pmd.  If to fix it anyway, a retry will be the
> > most accurate.
> >
> > I've went over all the pmd_trans_unstable() special cases and this patch
> > should cover all the rest places where we should retry properly with
> > unstable pmd.  With the newly introduced ACTION_AGAIN since 2020 we can
> > easily achieve that.
> >
> > These are the call sites that I think should be fixed with it:
> >
> > *** fs/proc/task_mmu.c:
> > smaps_pte_range[634]           if (pmd_trans_unstable(pmd))
> > clear_refs_pte_range[1194]     if (pmd_trans_unstable(pmd))
> > pagemap_pmd_range[1542]        if (pmd_trans_unstable(pmdp))
> > gather_pte_stats[1891]         if (pmd_trans_unstable(pmd))
> > *** mm/memcontrol.c:
> > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > *** mm/memory-failure.c:
> > hwpoison_pte_range[794]        if (pmd_trans_unstable(pmdp))
> > *** mm/mempolicy.c:
> > queue_folios_pte_range[517]    if (pmd_trans_unstable(pmd))
> > *** mm/madvise.c:
> > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > madvise_free_pte_range[625]    if (pmd_trans_unstable(pmd))
> >
> > IIUC most of them may or may not be a big issue even without a retry,
> > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > cold worst case), but some of them could have functional error without the
> > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> >
> > While these call sites all look fine, and don't need any change:
> >
> > *** include/linux/pgtable.h:
> > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > *** mm/gup.c:
> > follow_pmd_mask[695]           if (pmd_trans_unstable(pmd))
> > *** mm/mapping_dirty_helpers.c:
> > wp_clean_pmd_entry[131]        if (!pmd_trans_unstable(&pmdval))
> > *** mm/memory.c:
> > do_anonymous_page[4060]        if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > *** mm/migrate_device.c:
> > migrate_vma_insert_page[616]   if (unlikely(pmd_trans_unstable(pmdp)))
> > *** mm/mincore.c:
> > mincore_pte_range[116]         if (pmd_trans_unstable(pmd)) {
> >
> > Signed-off-by: Peter Xu <peterx@...hat.com>
> > ---
> >  fs/proc/task_mmu.c  | 17 +++++++++++++----
> >  mm/madvise.c        |  8 ++++++--
> >  mm/memcontrol.c     |  8 ++++++--
> >  mm/memory-failure.c |  4 +++-
> >  mm/mempolicy.c      |  4 +++-
> >  5 files changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 6259dd432eeb..823eaba5c6bf 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >                 goto out;
> >         }
> >
> > -       if (pmd_trans_unstable(pmd))
> > +       if (pmd_trans_unstable(pmd)) {
> > +               walk->action = ACTION_AGAIN;
> >                 goto out;
> > +       }
> > +
> >         /*
> >          * The mmap_lock held all the way back in m_start() is what
> >          * keeps khugepaged out of here and from collapsing things
> > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> >                 return 0;
> >         }
> >
> > -       if (pmd_trans_unstable(pmd))
> > +       if (pmd_trans_unstable(pmd)) {
> > +               walk->action = ACTION_AGAIN;
> >                 return 0;
> > +       }
> >
> >         pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >         for (; addr != end; pte++, addr += PAGE_SIZE) {
> > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> >                 return err;
> >         }
> >
> > -       if (pmd_trans_unstable(pmdp))
> > +       if (pmd_trans_unstable(pmdp)) {
> > +               walk->action = ACTION_AGAIN;
> >                 return 0;
> 
> Had a quick look at the pagemap code, I agree with your analysis,
> "returning 0" may mess up pagemap, retry should be fine. But I'm
> wondering whether we should just fill in empty entries. Anyway I don't
> have a  strong opinion on this, just a little bit concerned by
> potential indefinite retry.

Yes, none pte is still an option.  But if we're going to fix this anyway,
it seems better to fix it with the accurate new thing that poped up, and
it's even less change (just apply walk->action rather than doing random
stuff in different call sites).

I see that you have worry on deadloop over this, so I hope to discuss
altogether here.

Unlike normal checks, pmd_trans_unstable() check means something must have
changed in the past very short period or it should just never if nothing
changed concurrently from under us, so it's not a "if (flag==true)" check
which is even more likely to loop.

If we see the places that I didn't touch, most of them suggested a retry in
one form or another.  So if there's a worry this will also not the first
time to do a retry (and for such a "unstable" API, that's really the most
natural thing to do which is to retry until it's stable).

So in general, it seems to me if we deadloop over pmd_trans_unstable() for
whatever reason then something more wrong could have happened..

Thanks,

-- 
Peter Xu