[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <211bcb36-5834-4969-90c4-73e26431d463@redhat.com>
Date: Tue, 4 Feb 2025 11:58:58 +0100
From: David Hildenbrand <david@...hat.com>
To: Alistair Popple <apopple@...dia.com>, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, dri-devel@...ts.freedesktop.org,
linux-mm@...ck.org, nouveau@...ts.freedesktop.org,
Andrew Morton <akpm@...ux-foundation.org>, Jérôme Glisse
<jglisse@...hat.com>, Jonathan Corbet <corbet@....net>,
Alex Shi <alexs@...nel.org>, Yanteng Si <si.yanteng@...ux.dev>,
Karol Herbst <kherbst@...hat.com>, Lyude Paul <lyude@...hat.com>,
Danilo Krummrich <dakr@...nel.org>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, "Liam R. Howlett"
<Liam.Howlett@...cle.com>, Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
Pasha Tatashin <pasha.tatashin@...een.com>, Peter Xu <peterx@...hat.com>,
Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH v1 05/12] mm/memory: detect writability in
restore_exclusive_pte() through can_change_pte_writable()
On 31.01.25 18:05, Simona Vetter wrote:
> On Fri, Jan 31, 2025 at 11:55:55AM +0100, David Hildenbrand wrote:
>> On 31.01.25 00:06, Alistair Popple wrote:
>>> On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
>>>> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
>>>>> On 30.01.25 10:51, Simona Vetter wrote:
>>>>>> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>>>>>>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>>>>>>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>>>>>>> can_change_pte_writable().
>>>>>>>
>>>>>>> Set the PTE only dirty if the folio is dirty: we might not
>>>>>>> necessarily have a write access, and setting the PTE writable doesn't
>>>>>>> require setting the PTE dirty.
>>>>>>
>>>>>> Not sure whether there's much difference in practice, since a device
>>>>>> exclusive access means a write, so the folio better be dirty (unless we
>>>>>> aborted halfway through). But then I couldn't find the code in nouveau to
>>>>>> do that, so now I'm confused.
>>>>>
>>>>> That confused me as well. Requiring the PTE to be writable does not imply
>>>>> that it is dirty.
>>>>>
>>>>> So something must either set the PTE or the folio dirty.
>>>>
>>>> Yeah I'm not finding that something.
>>>>
>>>>> ( In practice, most anonymous folios are dirty most of the time ... )
>>>>
>>>> And yup that's why I think it hasn't blown up yet.
>>>>
>>>>> If we assume that "device-exclusive entries" are always dirty, then it
>>>>> doesn't make sense to set the folio dirty when creating device-exclusive
>>>>> entries. We'd always have to set the PTE dirty when restoring the exclusive
>>>>> pte.
>>>>
>>>> I do agree with your change, I think it's correct to put this
>>>> responsibility onto drivers. It's just that nouveau seems to not be
>>>> entirely correct.
>>>
>>> Yeah, agree it should be a driver responsibility but also can't see how nouveau
>>> is correct there either. I might see if I can get it to blow up...
>>
>> (in context of the rmap walkers) The question is, how do we consider
>> device-exclusive entries:
>>
>> (1) dirty? Not from a CPU perspective.
>> (2) referenced? Not from a CPU perspective.
>>
>> If the answer is always "no" to all questions, then memory notifiers must
>> handle it, because we'd be answering the question from the CPU point of
>> view.
>>
>> If the answer is always "yes", there is a problem: we can only make it
>> clean/young by converting it to an ordinary PTE first (requiring MMU
>> notifiers etc.), which makes it quite nasty.
>>
>> Mixed answers are not possible, because we don't know just from staring at
>> the entry.
>
> I think it's the gpu's (or whatever is using it) responsibility to update
> folio state while it has ptes pointing at memory. Whether that's
> device-exclusive system memory or device-private migrated memory. Anything
> else doesn't make sense to me conceptually.
>
> And I don't think we can just blindly assume even for device-exclusive
> mappings that they will be dirty when we convert them back to a real pte,
> because we might have raced trying to set up the gpu mapping and restarted
> before we even put the pte into place. Or maybe someone was real quick at
> writing it back after the gpu already dropped it's pte.
>
> I guess maybe some clear documentation in all these functions
> (make_device_exclusive, hmm_range_fault, migration helpers) that it's the
> drivers job to dirty pages correctly would help?
I'll add a comment to make_device_exclusive(), stating that these
entries are considered clean+old from a MM perspective, and that the
driver must update the folio when notified by MMU notifiers.
We should probably document that somewhere else as well as you suggested
separately.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists