[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <b45c99c9-02f9-7e97-7a98-71388a2a1bbd@amd.com>
Date: Thu, 1 Jul 2021 17:31:15 -0400
From: Felix Kuehling <felix.kuehling@....com>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
Dave Airlie <airlied@...il.com>,
Philip Yang <Philip.Yang@....com>,
Alex Deucher <alexander.deucher@....com>
Cc: Daniel Vetter <daniel.vetter@...ll.ch>,
dri-devel <dri-devel@...ts.freedesktop.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [git pull] drm for 5.14-rc1
Am 2021-07-01 um 4:15 p.m. schrieb Linus Torvalds:
> On Wed, Jun 30, 2021 at 9:34 PM Dave Airlie <airlied@...il.com> wrote:
>> Hi Linus,
>>
>> This is the main drm pull request for 5.14-rc1.
>>
>> I've done a test pull into your current tree, and hit two conflicts
>> (one in vc4, one in amdgpu), both seem pretty trivial, the amdgpu one
>> is recent and sfr sent out a resolution for it today.
> Well, the resolutions may be trivial, but the conflict made me look at
> the code, and it's buggy.
>
> Commit 04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function")
> is broken. It made the code do
>
> mmap_read_lock(mm);
> vma = find_vma(mm, start);
> mmap_read_unlock(mm);
>
> and then it *uses* that "vma" after it has dropped the lock.
>
> That's a big no-no - once you've dropped the lock, the vma contents
> simply aren't reliable any more. That mapping could now be unmapped
> and removed at any time.
>
> Now, the conflict actually made one of the uses go away (switching to
> vma_lookup() means that the subsequent code no longer needs to look at
> "vm_start" to verify we're actually _inside_ the vma), but it still
> checks for vma->vm_file afterwards.
>
> So those locking changes in commit 04d8d73dbcbe are completely bogus.
>
> I tried to fix up that bug while handling the conflict, but who knows
> what else similar is going on elsewhere.
>
> So I would ask people to
>
> (a) verify that I didn't make things worse as I fixed things up (note
> how I had to change the last argument to amdgpu_hmm_range_get_pages()
> from false to true etc).
>
> (b) go and look at their vma lookup code: you can't just look up a
> vma under the lock, and then drop the lock, and then think things stay
> stable.
>
> In particular for that (b) case: it is *NOT* enough to look up
> vma->vm_file inside the lock and cache that. No - if the test is about
> "no backing file before looking up pages", then you have to *keep*
> holding the lock until after you've actually looked up the pages!
>
> Because otherwise any test for "vma->vm_file" is entirely pointless,
> for the same reason it's buggy to even look at it after dropping the
> lock: because once you've dropped the lock, the thing you just tested
> for might not be true any more.
>
> So no, it's not valid to do
>
> bool has_file = vma && vma->vm_file;
>
> and then drop the lock, because you don't use 'vma' any more as a
> pointer, and then use 'has_file' outside the lock. Because after
> you've dropped the lock, 'has_file' is now meaningless.
>
> So it's not just about "you can't look at vma->vm_file after dropping
> the lock". It's more fundamental than that. Any *decision* you make
> based on the vma is entirely pointless and moot after the lock is
> dropped!
>
> Did I fix it up correctly? Who knows. The code makes more sense to me
> now and seems valid. But I really *really* want to stress how locking
> is important.
Thank you for the fix and the explanation. Your fix looks correct. I
also double-checked all other uses of find_vma in the amdgpu driver.
They all hold the mmap lock correctly.
Two comments:
With this fix, we could remove the bool mmap_locked parameter from
amdgpu_hmm_range_get_pages because it always gets called with the lock
held now.
You're now holding the mmap lock from the vma_lookup until
hmm_range_fault is done. This ensures that the result of the
vma->vm_file check remains valid. This was broken even before our commit
04d8d73dbcbe ("drm/amdgpu: add common HMM get pages function").
>
> You also can't just unlock in the middle of an operation - even if you
> then take the lock *again* later (as amdgpu_hmm_range_get_pages() then
> did), the fact that you unlocked in the middle means that all the
> earlier tests you did are simply no longer valid when you re-take the
> lock.
I agree completely. I catch a lot of locking bugs in code review. I
probably missed this one because I wasn't paying enough attention to
what was being protected by the mmap_read_lock in this case.
Regards,
Felix
>
> Linus
Powered by blists - more mailing lists