[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191004123732.xpr3vroee5mhg2zt@box.shutemov.name>
Date:   Fri, 4 Oct 2019 15:37:32 +0300
From:   "Kirill A. Shutemov" <kirill@...temov.name>
To:     Thomas Hellström (VMware) 
        <thomas_os@...pmail.org>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        torvalds@...ux-foundation.org,
        Thomas Hellstrom <thellstrom@...are.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Will Deacon <will.deacon@....com>,
        Peter Zijlstra <peterz@...radead.org>,
        Rik van Riel <riel@...riel.com>,
        Minchan Kim <minchan@...nel.org>,
        Michal Hocko <mhocko@...e.com>,
        Huang Ying <ying.huang@...el.com>,
        Jérôme Glisse <jglisse@...hat.com>
Subject: Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the
 pagewalk code
On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
> > > + *   If @mapping allows faulting of huge pmds and puds, it is desirable
> > > + *   that its huge_fault() handler blocks while this function is running on
> > > + *   @mapping. Otherwise a race may occur where the huge entry is split when
> > > + *   it was intended to be handled in a huge entry callback. This requires an
> > > + *   external lock, for example that @mapping->i_mmap_rwsem is held in
> > > + *   write mode in the huge_fault() handlers.
> > Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
> > on read) to split PMD entry into PTE table. And it can happen not only
> > from fault path.
> > 
> > If you care about splitting compound page under you, take a pin or lock a
> > page. It will block split_huge_page().
> > 
> > Suggestion to block fault path is not viable (and it will not happen
> > magically just because of this comment).
> > 
> I was specifically thinking of this:
> 
> https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
> 
> If a huge pud is concurrently faulted in here, it will immediatly get split
> without getting processed in pud_entry(). An external lock would protect
> against that, but that's perhaps a bug in the pagewalk code?  For pmds the
> situation is not the same since when pte_entry is used, all pmds will
> unconditionally get split.
I *think* it should be fixed with something like this (there's no
pud_trans_unstable() yet):
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index d48c2a986ea3..221a3b945f42 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 					break;
 				continue;
 			}
+		} else {
+			split_huge_pud(walk->vma, pud, addr);
 		}
 
-		split_huge_pud(walk->vma, pud, addr);
-		if (pud_none(*pud))
+		if (pud_none(*pud) || pud_trans_unstable(*pud))
 			goto again;
 
 		if (ops->pmd_entry || ops->pte_entry)
Or better yet converted to what we do on pmd level.
Honestly, all the code around PUD THP missing a lot of ground work.
Rushing it upstream for DAX was not a right move.
> There's a similar more scary race in
> 
> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
> 
> It looks like if a concurrent thread faults in a huge pud just after the
> test for pud_none in that pmd_alloc, things might go pretty bad.
Hm? It will fail the next pmd_none() check under ptl. Do you have a
particular racing scenarion?
-- 
 Kirill A. Shutemov
Powered by blists - more mailing lists
 
