lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFxrzKt2NpgERtg+QwxFPHZ1vMFNxyWEtrQR2-Hj3-xxLg@mail.gmail.com>
Date:   Tue, 29 Aug 2017 12:06:42 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     Andrea Arcangeli <aarcange@...hat.com>,
        Adam Borowski <kilobyte@...band.pl>,
        Takashi Iwai <tiwai@...e.de>, Bernhard Held <berny156@....de>,
        Nadav Amit <nadav.amit@...il.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <kernellwp@...il.com>,
        Radim Krčmář <rkrcmar@...hat.com>,
        Joerg Roedel <jroedel@...e.de>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        kvm <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Michal Hocko <mhocko@...nel.org>
Subject: Re: kvm splat in mmu_spte_clear_track_bits

On Tue, Aug 29, 2017 at 11:34 AM, Jerome Glisse <jglisse@...hat.com> wrote:
>
> Kirill did regress invalidate_page as it use to be call outside the
> spinlock and now it is call inside the spinlock thus reverting will
> introduce back a regression.

Honestly, this MMU notifier thing has been nothing but a badly
designed mistake from beginning to end, and bad rules for what can
sleep and what can not are one fundamental problem.

There are fundamentally two levels of VM locking, and those two levels
are not going to go away, and we're not budging on them:

 - there's the "virtual address" level, which can block. We have a
nice mmap_semaphore, and we guarantee that it's held for writing for
all changes to the virtual memory layout

   This is the "mmap/munmap" kind of granularity. The mmu callbacks at
*this* level are fine to block.

 - then there is the "page level" VM handling, and honestly, that
*fundamentally* uses a spinlock. If we look at a particular page, that
page is meaningless without the lock. Really.

   I honestly believe that any MMU callback at this level needs to be
atomic. Some of the absolutely *have* to be (that "change_pte", for
example).

In that second case, we might have a "begin/end" surrounding the
actual page table walk. And that might sleep, but then it
*fundamentally*  cannot actually be able some particular single page
or stable range. Because without the page table spinlock, no such
stability exists. It's purely a "we are not going to start looking at
this range" kind of thing.

I really don't understand why the nVidia crap cannot follow those
simple rules. Because either

 (a) you're working with virtual addresses, and you should be able to
work on that virtual layer

 (b) you're actually working with physical pages, and you can just
hold on to those physical pages yourself.

I really detest our MMU callbacks. I shouldn't have allowed them to be
merged. And I definitely shoul.dn't allow them to screw up our VM
layer.

But we have them, and we should work at making sure people do sane things.

And yes, those sane things may include

 (a) guaranteeing that the start/end range calls are always done
around the actual locked region.

 (b) adding a ton of validation so that people *see* then they break
the rules. Even when they don't use some random notifier crud.

That (b) may involve adding a number of "might_sleep()" calls (not
deep in the notifiers themselves, but in the actual wrapper functions
even when notifiers are compiled out entirely!), but also adding calls
to validate those kinds of "you can't call
mmu_notifier_invalidate_page() without having first called
mmu_notifier_invalidate_range_start() in a sleepable context".

But (b) definitely should also be a very real onus on the mmu
notifiers themselves. No way can we sleep when we're traversing page
tables. We hold a page table lock. We can sleep before and after, but
not during actual page traversal.

            Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ