[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.21.1809131622180.1473@nanos.tec.linutronix.de>
Date: Thu, 13 Sep 2018 18:22:17 +0200 (CEST)
From: Thomas Gleixner <tglx@...utronix.de>
To: Brijesh Singh <brijesh.singh@....com>
cc: x86@...nel.org, linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
Tom Lendacky <thomas.lendacky@....com>,
Borislav Petkov <bp@...e.de>,
Paolo Bonzini <pbonzini@...hat.com>,
Sean Christopherson <sean.j.christopherson@...el.com>,
Radim Krčmář <rkrcmar@...hat.com>
Subject: Re: [PATCH v7 0/5] x86: Fix SEV guest regression
On Mon, 10 Sep 2018, Brijesh Singh wrote:
> x86/kvmclock: Remove memblock dependency
> introduced SEV guest regression.
>
> The guest physical address holding the wall_clock and hv_clock_boot
> are shared with the hypervisor must be mapped with C=0 when SEV
> is active. To clear the C-bit we use kernel_physical_mapping_init() to
> split the large pages. The above commit moved the kvmclock initialization
> very early and kernel_physical_mapping_init() fails to allocate memory
> while spliting the large page.
>
> To solve it, we add a special .data..decrypted section, this section can be
> used to hold the shared variables. Early boot code maps this section with
> C=0. The section is pmd aligned and sized to avoid the need to split the pages.
> Caller can use __decrypted attribute to add the variables in .data..decrypted
> section.
I went through this patch series and to be honest this is in no way -rc3+
material.
I really have to ask why this whole change to sme_encrypt_kernel() is
required at all.
Lets look at the problem at hand:
We need exactly TWO pages to be shareable at early boot when the
kernel runs in a guest and wants to utilize kvmclock.
No, we do not need the whole NR_CPUS sized thing at that point at
all. The secondary CPUs are brought up way later and there is absolutely
no point in having the ugly decrypted..aux hack. That conditional freeing
is really horrible and prone to break sooner than later.
Further the TWO pages are BSS pages which does not require to handle them
special other than fixing up the PMD and doing a memset(0) on the
actually utilized TWO pages.
We really can enforce that such decrypted variables have to be in the BSS
for now and revisit this when there is an absolute requirement to have
statically initialized ones, which I doubt we have.
So the way simpler solution is:
1) Add a .bss..decrypted section which is PMD aligned and mark the
kvmclock hv_clock_boot and wallclock struct with __bss_decrypted.
2) Fixup the .bss..decrypted section PMD in the SEV case at the end of
sme_encrypt_kernel() and do a memset(0) on that. That does not require
any of the restructuring, really.
3) Have a function which is called after the page allocator is up which
does:
static struct pvclock_vsyscall_time_info *hvclock_mem;
void __init kvmclock_init_mem(void)
{
unsigned int ncpus = num_possible_cpus() - HVC_BOOT_ARRAY_SIZE;
unsigned int order = get_order(ncpus * sizeof(*hvclock_mem));
struct page *p;
p = alloc_pages(order, GFP_KERNEL);
if (p) {
hvclock_mem = page_address(p);
set_memory_decrypted((unsigned long) hvclock_mem,
PAGE_SIZE << order);
}
}
4) In kvmclock_setup_percpu() do the following:
if (cpu < HVC_BOOT_ARRAY_SIZE)
p = &hv_clock_boot[cpu];
else if (hvclock_mem)
p = hvclock_mem + cpu - HVC_BOOT_ARRAY_SIZE;
else
return -ENOMEM;
5) Unconditionally free the pages after the used .bbs..decrypted variables,
so the unused rest of the 2M page is not lost.
That should be a halfways slim sized and non instrusive changeset.
Thanks,
tglx
Powered by blists - more mailing lists