[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <51ccff8c-bb09-3dc4-4d75-bf1b86ca75a9@de.ibm.com>
Date: Thu, 1 Jun 2017 13:24:26 +0200
From: Christian Borntraeger <borntraeger@...ibm.com>
To: Martin Schwidefsky <schwidefsky@...ibm.com>,
David Hildenbrand <david@...hat.com>
Cc: kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Heiko Carstens <heiko.carstens@...ibm.com>,
Thomas Huth <thuth@...hat.com>
Subject: Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste
On 06/01/2017 12:46 PM, Martin Schwidefsky wrote:
> Hi David,
>
> it is nice to see that you are still working on s390 related topics.
>
> On Mon, 29 May 2017 18:32:00 +0200
> David Hildenbrand <david@...hat.com> wrote:
>
>> Having to enable vm.alloc_pgste globally might not be the best solution.
>> 4k page tables are created for all processes and running QEMU KVM guests
>> is more complicated than it should be.
>
> To run KVM guests you need to issue a single sysctl to set vm.allocate_pgste,
> this is the best solution we found so far.
Suse and Ubuntu seem to have a sysctl.conf file in the qemu-kvm package that
does a global switch.
>
>> Unfortunately, converting all page tables to 4k pgste page tables is
>> not possible without provoking various race conditions.
>
> That is one approach we tried and was found to be buggy. The point is that
> you are not allowed to reallocate a page table while a VMA exists that is
> in the address range of that page table.
>
> Another approach we tried is to use an ELF flag on the qemu executable.
> That does not work either because fs/exec.c allocates and populates the
> new mm struct for the argument pages before fs/binfmt_elf.c comes into
> play.
>
>> However, we
>> might be able to let 2k and 4k page tables co-exist. We only need
>> 4k page tables whenever we want to expose such memory to a guest. So
>> turning on 4k page table allocation at one point and only allowing such
>> memory to go into our gmap (guest mapping) might be a solution.
>> User space tools like QEMU that create the VM before mmap-ing any memory
>> that will belong to the guest can simply use the new VM type. Proper 4k
>> page tables will be created for any memory mmap-ed afterwards. And these
>> can be used in the gmap without problems. Existing user space tools
>> will work as before - having to enable vm.alloc_pgste explicitly.
>
> I can not say that I like this approach. Right now a process either uses
> 2K page tables or 4K page tables. With your patch it is basically per page
> table page. Memory areas that existed before the switch to allocate
> 4K page tables can not be mapped to the guests gmap anymore. There might
> be hidden pitfalls e.g. with guest migration.
>
>> This should play fine with vSIE, as vSIE code works completely on the gmap.
>> So if only page tables with pgste go into our gmap, we should be fine.
>>
>> Not sure if this breaks important concepts, has some serious performance
>> problems or I am missing important cases. If so, I guess there is really
>> no way to avoid setting vm.alloc_pgste.
>>
>> Possible modifications:
>> - Enable this option via an ioctl (like KVM_S390_ENABLE_SIE) instead of
>> a new VM type
>> - Remember if we have mixed pgtables. If !mixed, we can make maybe faster
>> decisions (if that is really a problem).
>
> What I do not like in particular is this function:
>
> static inline int pgtable_has_pgste(struct mm_struct *mm, unsigned long addr)
> {
> struct page *page;
>
> if (!mm_has_pgste(mm))
> return 0;
>
> page = pfn_to_page(addr >> PAGE_SHIFT);
> return atomic_read(&page->_mapcount) & 0x4U;
> }
The good thing with this approach is that the first condition will make non-KVM
processes as fast as before. In fact, given the sysctl thing being present everywhere,
this patch might actually move non-KVM processes back to 2k page tables so it
improve those.
>
> The check for pgstes got more complicated, it used to be a test-under-mask
> of a bit in the mm struct and a branch. Now we have an additional pfn_to_page,
> an atomic_read and a bit test. That is done multiple times for every ptep_xxx
> operation.
>
> Is the operational simplification of not having to set vm.allocate_pgste really
> that important ?
>
Powered by blists - more mailing lists