[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20210827110651.1950-1-sjpark@amazon.de>
Date: Fri, 27 Aug 2021 11:06:51 +0000
From: SeongJae Park <sj38.park@...il.com>
To: David Hildenbrand <david@...hat.com>
Cc: SeongJae Park <sj38.park@...il.com>, akpm@...ux-foundation.org,
markubo@...zon.com, SeongJae Park <sjpark@...zon.de>,
Jonathan.Cameron@...wei.com, acme@...nel.org,
alexander.shishkin@...ux.intel.com, amit@...nel.org,
benh@...nel.crashing.org, brendanhiggins@...gle.com,
corbet@....net, dwmw@...zon.com, elver@...gle.com,
fan.du@...el.com, foersleo@...zon.de, greg@...ah.com,
gthelen@...gle.com, guoju.fgj@...baba-inc.com, jgowans@...zon.com,
joe@...ches.com, mgorman@...e.de, mheyne@...zon.de,
minchan@...nel.org, mingo@...hat.com, namhyung@...nel.org,
peterz@...radead.org, riel@...riel.com, rientjes@...gle.com,
rostedt@...dmis.org, rppt@...nel.org, shakeelb@...gle.com,
shuah@...nel.org, sieberf@...zon.com, snu@...le79.org,
vbabka@...e.cz, vdavydov.dev@...il.com, zgf574564920@...il.com,
linux-damon@...zon.com, linux-mm@...ck.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v34 05/13] mm/damon: Implement primitives for the virtual memory address spaces
From: SeongJae Park <sjpark@...zon.de>
On Thu, 26 Aug 2021 23:42:19 +0200 David Hildenbrand <david@...hat.com> wrote:
> On 26.08.21 19:29, SeongJae Park wrote:
> > From: SeongJae Park <sjpark@...zon.de>
> >
> > Hello David,
> >
> >
> > On Thu, 26 Aug 2021 16:09:23 +0200 David Hildenbrand <david@...hat.com> wrote:
> >
> >>> +static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
> >>> +{
> >>> + pte_t *pte = NULL;
> >>> + pmd_t *pmd = NULL;
> >>> + spinlock_t *ptl;
> >>> +
> >>
> >> I just stumbled over this, sorry for the dumb questions:
> >
> > Appreciate for the great questions!
> >
> >>
> >>
> >> a) What do we know about that region we are messing with?
> >>
> >> AFAIU, just like follow_pte() and follow_pfn(), follow_invalidate_pte()
> >> should only be called on VM_IO and raw VM_PFNMAP mappings in general
> >> (see the doc of follow_pte()). Do you even know that it's within a
> >> single VMA and that there are no concurrent modifications?
> >
> > We have no idea about the region at this moment. However, if we successfully
> > get the pte or pmd under the protection of the page table lock, we ensure the
> > page for the pte or pmd is a online LRU-page with damon_get_page(), before
> > updating the pte or pmd's PAGE_ACCESSED bit. We release the page table lock
> > only after the update.
> >
> > And concurrent VMA change doesn't matter here because we read and write only
> > the page table. If the address is not mapped or not backed by LRU pages, we
> > simply treat it as not accessed.
>
> reading/writing page tables is the real problem.
>
> >
> >>
> >> b) Which locks are we holding?
> >>
> >> I hope we're holding the mmap lock in read mode at least. Or how are you
> >> making sure there are no concurrent modifications to page tables / VMA
> >> layout ... ?
> >>
> >>> + if (follow_invalidate_pte(mm, addr, NULL, &pte, &pmd, &ptl))
> >
> > All the operations are protected by the page table lock of the pte or pmd, so
> > no concurrent page table modification would happen. As previously mentioned,
> > because we read and update only page table, we don't care about VMAs and
> > therefore we don't need to hold mmap lock here.
>
> See below, that's unfortunately not sufficient.
>
> >
> > Outside of this function, DAMON reads the VMAs to know which address ranges are
> > not mapped, and avoid inefficiently checking access to the area with the
> > information. Nevertheless, it happens only occasionally (once per 60 seconds
> > by default), and it holds the mmap read lock in the case.
> >
> > Nonetheless, I agree the usage of follow_invalidate_pte() here could make
> > readers very confusing. It would be better to implement and use DAMON's own
> > page table walk logic. Of course, I might missing something important. If you
> > think so, please don't hesitate at yelling to me.
>
>
> I'm certainly not going to yell :) But unfortunately I'll have to tell
> you that what you are doing is in my understanding fundamentally broken.
>
> See, page tables might get removed any time
> a) By munmap() code even while holding the mmap semaphore in read (!)
> b) By khugepaged holding the mmap lock in write mode
>
> The rules are (ignoring the rmap side of things)
>
> a) You can walk page tables inside a known VMA with the mmap semaphore
> held in read mode. If you drop the mmap sem, you have to re-validate the
> VMA! Anything could have changed in the meantime. This is essentially
> what mm/pagewalk.c does.
>
> b) You can walk page tables ignoring VMAs with the mmap semaphore held
> in write mode.
>
> c) You can walk page tables lockless if the architecture supports it and
> you have interrupts disabled the hole time. But you are not allowed to
> write.
>
> With what you're doing, you might end up reading random garbage as page
> table pointers, or writing random garbage to pages that are no longer
> used as page tables.
>
> Take a look at mm/gup.c:lockless_pages_from_mm() to see how difficult it
> is to walk page tables lockless. And it only works because page table
> freeing code synchronizes either via IPI or fake-rcu before actually
> freeing a page table.
>
> follow_invalidate_pte() is, in general, the wrong thing to use. It's
> specialized to VM_IO and VM_PFNMAP. take a look at the difference in
> complexity between follow_invalidate_pte() and mm/pagewalk.c!
>
> I'm really sorry, but as far as I can tell, this is locking-wise broken
> and follow_invalidate_pte() is the wrong interface to use here.
>
> Someone can most certainly correct me if I'm wrong, or if I'm missing
> something regarding your implementation, but if you take a look around,
> you won't find any code walking page tables without at least holding the
> mmap sem in read mode -- for a good reason.
Thank you very much for this kind explanation, David! I will send a patch for
this soon.
Thanks,
SJ
>
> --
> Thanks,
>
> David / dhildenb
>
Powered by blists - more mailing lists