[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zt-N8MB93XSqFZO_@x1n>
Date: Mon, 9 Sep 2024 20:08:16 -0400
From: Peter Xu <peterx@...hat.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Yan Zhao <yan.y.zhao@...el.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, Gavin Shan <gshan@...hat.com>,
Catalin Marinas <catalin.marinas@....com>, x86@...nel.org,
Ingo Molnar <mingo@...hat.com>, Paolo Bonzini <pbonzini@...hat.com>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Thomas Gleixner <tglx@...utronix.de>,
Alistair Popple <apopple@...dia.com>, kvm@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org,
Sean Christopherson <seanjc@...gle.com>,
Oscar Salvador <osalvador@...e.de>,
Jason Gunthorpe <jgg@...dia.com>, Borislav Petkov <bp@...en8.de>,
Zi Yan <ziy@...dia.com>, Axel Rasmussen <axelrasmussen@...gle.com>,
David Hildenbrand <david@...hat.com>, Will Deacon <will@...nel.org>,
Kefeng Wang <wangkefeng.wang@...wei.com>,
Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
On Mon, Sep 09, 2024 at 04:15:39PM -0700, Andrew Morton wrote:
> On Mon, 9 Sep 2024 18:43:22 -0400 Peter Xu <peterx@...hat.com> wrote:
>
> > > > > Do we need the logic to clear dirty bit in the child as that in
> > > > > __copy_present_ptes()? (and also for the pmd's case).
> > > > >
> > > > > e.g.
> > > > > if (vma->vm_flags & VM_SHARED)
> > > > > pud = pud_mkclean(pud);
> > > >
> > > > Yeah, good question. I remember I thought about that when initially
> > > > working on these lines, but I forgot the details, or maybe I simply tried
> > > > to stick with the current code base, as the dirty bit used to be kept even
> > > > in the child here.
> > > >
> > > > I'd expect there's only performance differences, but still sounds like I'd
> > > > better leave that to whoever knows the best on the implications, then draft
> > > > it as a separate patch but only when needed.
> > >
> > > Sorry, but this vaguensss simply leaves me with nowhere to go.
> > >
> > > I'll drop the series - let's revisit after -rc1 please.
> >
> > Andrew, would you please explain why it needs to be dropped?
> >
> > I meant in the reply that I think we should leave that as is, and I think
> > so far nobody in real life should care much on this bit, so I think it's
> > fine to leave the dirty bit as-is.
> >
> > I still think whoever has a better use of the dirty bit and would like to
> > change the behavior should find the use case and work on top, but only if
> > necessary.
>
> Well. "I'd expect there's only performance differences" means to me
> "there might be correctness issues, I don't know". Is it or is it not
> merely a performance thing?
There should have no correctness issue pending. It can only be about
performance, and AFAIU what this patch does is exactly the way where it
shouldn't ever change performance either, as it didn't change how dirty bit
was processed (just like before this patch), not to mention correctness (in
regards to dirty bits).
I can provide some more details.
Here the question we're discussing is "whether we should clear the dirty
bit in the child for a pgtable entry when it's VM_SHARED". Yan observed
that we don't do the same thing for pte/pmd/pud, which is true.
Before this patch:
- For pte: we clear dirty bit if VM_SHARED in child when copy
- For pmd/pud: we never clear dirty bit in the child when copy
The behavior of clearing dirty bit for VM_SHARED in child for pte level
originates to the 1st commit that git history starts. So we always do so
for 19 years.
That makes sense to me, because clearing dirty bit in pte normally requires
a SetDirty on the folio, e.g. in unmap path:
if (pte_dirty(pteval))
folio_mark_dirty(folio);
Hence cleared dirty bit in the child should avoid some extra overheads when
the pte maps a file cache, so clean pte can at least help us to avoid calls
into e.g. mapping's dirty_folio() functions (in which it should normally
check folio_test_set_dirty() again anyway, and parent pte still have the
dirty bit set so we won't miss setting folio dirty):
folio_mark_dirty():
if (folio_test_reclaim(folio))
folio_clear_reclaim(folio);
return mapping->a_ops->dirty_folio(mapping, folio);
However there's the other side of thing where when the dirty bit is missing
I _think_ it also means when the child writes to the cleaned pte, it'll
require (e.g. on hardware accelerated archs) MMU setting dirty bit which is
slower than if we don't clear the dirty bit... and on software emulated
dirty bits it could even require a page fault, IIUC.
In short, personally I don't know what's the best to do, on keep / remove
the dirty bit even if it's safe either way: there are pros and cons on
different decisions.
That's why I said I'm not sure which is the best way. I had a feeling that
most of the people didn't even notice this, and we kept running this code
for the past 19 years just all fine..
OTOH, we don't do the same for pmds/puds (in which case we persist dirty
bits always in child), and I didn't check whether it's intended, or why.
It'll have similar reasoning as above discussion on pte, or even more I
overlooked.
So again, the safest approach here is in terms of dirty bit we keep what we
do as before. And that's what this patch does as of now.
IOW, if I'll need a repost, I'll repost exactly the same thing (with the
fixup I sent later, which is already in mm-unstable).
Thanks,
--
Peter Xu
Powered by blists - more mailing lists