linux-kernel - Re: kvm splat in mmu_spte_clear_track

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CA+55aFwq3M+asJz+1G1iU3pWLqWKnRD-7ufASERYA5vPZfVeLA@mail.gmail.com>
Date:   Tue, 29 Aug 2017 12:38:43 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     Andrea Arcangeli <aarcange@...hat.com>,
        Adam Borowski <kilobyte@...band.pl>,
        Takashi Iwai <tiwai@...e.de>, Bernhard Held <berny156@....de>,
        Nadav Amit <nadav.amit@...il.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <kernellwp@...il.com>,
        Radim Krčmář <rkrcmar@...hat.com>,
        Joerg Roedel <jroedel@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kvm <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Michal Hocko <mhocko@...nel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 12:13 PM, Jerome Glisse <jglisse@...hat.com> wrote:
>
> Yes and i am fine with page traversal being under spinlock and not
> being able to sleep during that. I agree doing otherwise would be
> insane. It is just that the existing behavior of try_to_unmap_one()
> and page_mkclean_one() have been broken and that no mmu_notifier
> calls were added around the lock section.

Yeah, I'm actually surprised that ever worked. I'm surprised that
try_to_unmap_one didn't hold any locks earlier.

In fact, I think at least some of them *did* already hold the page
table locks: ptep_clear_flush_young_notify() and friends very much
should have always held them.

So it's literally just that mmu_notifier_invalidate_page() call that
used to be outside all the locks, but honestly, I think that was
always a bug. It means that you got notified of the page removal
*after* the page was already gone and all locks had been released, so
a completely *different* page could already have been mapped to that
address.

So I think the old code was always broken exactly because the callback
wasn't serialized with the actual action.

> I sent a patch that properly compute the range to invalidate and move
> to invalidate_range() but is lacking the invalidate_range_start()/
> end() so i am gonna respin that with range_start/end bracketing and
> assume the worse for the range of address.

So surrounding it with start/end _should_ make KVM happy.

KVM people, can you confirm?

But I do note that there's a number of other users of that
"invalidate_page" callback.

I think ib_umem_notifier_invalidate_page() the exact same blocking
issue, but changing to range_start/end should be good there too.

amdgpu_mn_invalidate_page() and the xen/gntdev also seem to be happy
being replaced with start/end.

In fact, I'm wondering if this actually means that we could get rid of
mmu_notifier_invalidate_page() entirely. There's only a couple of
callers, and the other one seems to be fs/dax.c, and it actually seems
to have the exact same issue that the try_to_unmap_one() code had: it
tried to invalidate an address too late - by the time it was called,
the page gad already been cleaned and locks had been released.

So the more I look at that "turn mmu_notifier_invalidate_page() into
invalidate_range_start/end()" the more I think that's fundamentally
the right thing to do.

                 Linus