[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8ef9fff3-df8d-cc14-35f9-d83db62e874f@shipmail.org>
Date: Fri, 4 Oct 2019 14:58:59 +0200
From: Thomas Hellström (VMware)
<thomas_os@...pmail.org>
To: "Kirill A. Shutemov" <kirill@...temov.name>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
torvalds@...ux-foundation.org,
Thomas Hellstrom <thellstrom@...are.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Will Deacon <will.deacon@....com>,
Peter Zijlstra <peterz@...radead.org>,
Rik van Riel <riel@...riel.com>,
Minchan Kim <minchan@...nel.org>,
Michal Hocko <mhocko@...e.com>,
Huang Ying <ying.huang@...el.com>,
Jérôme Glisse <jglisse@...hat.com>
Subject: Re: [PATCH v3 2/7] mm: Add a walk_page_mapping() function to the
pagewalk code
On 10/4/19 2:37 PM, Kirill A. Shutemov wrote:
> On Thu, Oct 03, 2019 at 01:32:45PM +0200, Thomas Hellström (VMware) wrote:
>>>> + * If @mapping allows faulting of huge pmds and puds, it is desirable
>>>> + * that its huge_fault() handler blocks while this function is running on
>>>> + * @mapping. Otherwise a race may occur where the huge entry is split when
>>>> + * it was intended to be handled in a huge entry callback. This requires an
>>>> + * external lock, for example that @mapping->i_mmap_rwsem is held in
>>>> + * write mode in the huge_fault() handlers.
>>> Em. No. We have ptl for this. It's the only lock required (plus mmap_sem
>>> on read) to split PMD entry into PTE table. And it can happen not only
>>> from fault path.
>>>
>>> If you care about splitting compound page under you, take a pin or lock a
>>> page. It will block split_huge_page().
>>>
>>> Suggestion to block fault path is not viable (and it will not happen
>>> magically just because of this comment).
>>>
>> I was specifically thinking of this:
>>
>> https://elixir.bootlin.com/linux/latest/source/mm/pagewalk.c#L103
>>
>> If a huge pud is concurrently faulted in here, it will immediatly get split
>> without getting processed in pud_entry(). An external lock would protect
>> against that, but that's perhaps a bug in the pagewalk code? For pmds the
>> situation is not the same since when pte_entry is used, all pmds will
>> unconditionally get split.
> I *think* it should be fixed with something like this (there's no
> pud_trans_unstable() yet):
>
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index d48c2a986ea3..221a3b945f42 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -102,10 +102,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
> break;
> continue;
> }
> + } else {
> + split_huge_pud(walk->vma, pud, addr);
> }
>
> - split_huge_pud(walk->vma, pud, addr);
> - if (pud_none(*pud))
> + if (pud_none(*pud) || pud_trans_unstable(*pud))
> goto again;
>
> if (ops->pmd_entry || ops->pte_entry)
Yes, this seems better. I was looking at implementing a
pud_trans_unstable() as a basis of fixing problems like this, but when I
looked at pmd_trans_unstable I got a bit confused:
Why are devmap huge pmds considered stable? I mean, couldn't anybody
just run madvise() to clear those just like transhuge pmds?
>
> Or better yet converted to what we do on pmd level.
>
> Honestly, all the code around PUD THP missing a lot of ground work.
> Rushing it upstream for DAX was not a right move.
>
>> There's a similar more scary race in
>>
>> https://elixir.bootlin.com/linux/latest/source/mm/memory.c#L3931
>>
>> It looks like if a concurrent thread faults in a huge pud just after the
>> test for pud_none in that pmd_alloc, things might go pretty bad.
> Hm? It will fail the next pmd_none() check under ptl. Do you have a
> particular racing scenarion?
>
Yes, I misinterpreted the code somewhat, but here's the scenario that
looks racy:
Thread 1 Thread 2
huge_fault(pud) - Fell back, for example because of write fault on dirty-tracking.
huge_fault(pud) - Taken, read fault.
pmd_alloc() - Will fail pmd_none check and return a pmd_offset()
into thread 2's THP.
Thanks,
Thomas
Powered by blists - more mailing lists