[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <463b86c8-e9a0-fc13-efa4-31df3aea8e54@oracle.com>
Date: Mon, 13 May 2019 18:47:09 +0200
From: Alexandre Chartre <alexandre.chartre@...cle.com>
To: Dave Hansen <dave.hansen@...el.com>, pbonzini@...hat.com,
rkrcmar@...hat.com, tglx@...utronix.de, mingo@...hat.com,
bp@...en8.de, hpa@...or.com, dave.hansen@...ux.intel.com,
luto@...nel.org, peterz@...radead.org, kvm@...r.kernel.org,
x86@...nel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc: konrad.wilk@...cle.com, jan.setjeeilers@...cle.com,
liran.alon@...cle.com, jwadams@...gle.com
Subject: Re: [RFC KVM 19/27] kvm/isolation: initialize the KVM page table with
core mappings
On 5/13/19 5:50 PM, Dave Hansen wrote:
>> + /*
>> + * Copy the mapping for all the kernel text. We copy at the PMD
>> + * level since the PUD is shared with the module mapping space.
>> + */
>> + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE,
>> + PGT_LEVEL_PMD);
>> + if (rv)
>> + goto out_uninit_page_table;
>
> Could you double-check this? We (I) have had some repeated confusion
> with the PTI code and kernel text vs. kernel data vs. __init.
> KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than
> kernel text.
I probably have the same confusion :-) but I will try to check again.
>> + /*
>> + * Copy the mapping for cpu_entry_area and %esp fixup stacks
>> + * (this is based on the PTI userland address space, but probably
>> + * not needed because the KVM address space is not directly
>> + * enterered from userspace). They can both be copied at the P4D
>> + * level since they each have a dedicated P4D entry.
>> + */
>> + rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE,
>> + PGT_LEVEL_P4D);
>> + if (rv)
>> + goto out_uninit_page_table;
>
> cpu_entry_area is used for more than just entry from userspace. The gdt
> mapping, for instance, is needed everywhere. You might want to go look
> at 'struct cpu_entry_area' in some more detail.
Ok. Thanks.
>> +#ifdef CONFIG_X86_ESPFIX64
>> + rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE,
>> + PGT_LEVEL_P4D);
>> + if (rv)
>> + goto out_uninit_page_table;
>> +#endif
>
> Why are these mappings *needed*? I thought we only actually used these
> fixup stacks for some crazy iret-to-userspace handling. We're certainly
> not doing that from KVM context.
Right. I initially looked what was used for PTI, and I probably copied unneeded
mapping.
> Am I forgetting something?
>
>> +#ifdef CONFIG_VMAP_STACK
>> + /*
>> + * Interrupt stacks are vmap'ed with guard pages, so we need to
>> + * copy mappings.
>> + */
>> + for_each_possible_cpu(cpu) {
>> + stack = per_cpu(hardirq_stack_ptr, cpu);
>> + pr_debug("IRQ Stack %px\n", stack);
>> + if (!stack)
>> + continue;
>> + rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE);
>> + if (rv)
>> + goto out_uninit_page_table;
>> + }
>> +
>> +#endif
>
> I seem to remember that the KVM VMENTRY/VMEXIT context is very special.
> Interrupts (and even NMIs?) are disabled. Would it be feasible to do
> the switching in there so that we never even *get* interrupts in the KVM
> context?
Ideally we would like to run with the KVM address space when handling a VM-exit
(so between a VMEXIT and the next VMENTER) where interrupts are not disabled.
> I also share Peter's concerns about letting modules do this. If we ever
> go down this road, we're going to have to think very carefully how we
> let KVM do this without giving all the not-so-nice out-of-tree modules
> the keys to the castle.
Right, we probably need some more generic framework for creating limited
kernel context space which kvm (or other module?) can deal with. I think
kvm is a good place to start for having this kind of limited context, hence
this RFC and my request for advice how best to do it.
> A high-level comment: it looks like this is "working", but has probably
> erred on the side of mapping too much. The hard part is paring this
> back to a truly minimal set of mappings.
>
Agree.
Thanks,
alex.
Powered by blists - more mailing lists