linux-kernel - Re: [RFC KVM 18/27] kvm/isolation: function to copy page table entries for percpu buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e5fedad9-4607-0aa4-297e-398c0e34ae2b@oracle.com>
Date:   Tue, 14 May 2019 18:24:48 +0200
From:   Alexandre Chartre <alexandre.chartre@...cle.com>
To:     Andy Lutomirski <luto@...nel.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Radim Krcmar <rkrcmar@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        kvm list <kvm@...r.kernel.org>, X86 ML <x86@...nel.org>,
        Linux-MM <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
        jan.setjeeilers@...cle.com, Liran Alon <liran.alon@...cle.com>,
        Jonathan Adams <jwadams@...gle.com>
Subject: Re: [RFC KVM 18/27] kvm/isolation: function to copy page table
 entries for percpu buffer


On 5/14/19 5:23 PM, Andy Lutomirski wrote:
> On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre
> <alexandre.chartre@...cle.com> wrote:
>>
>>
>> On 5/14/19 10:34 AM, Andy Lutomirski wrote:
>>>
>>>
>>>> On May 14, 2019, at 1:25 AM, Alexandre Chartre <alexandre.chartre@...cle.com> wrote:
>>>>
>>>>
>>>>> On 5/14/19 9:09 AM, Peter Zijlstra wrote:
>>>>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote:
>>>>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre
>>>>>> <alexandre.chartre@...cle.com> wrote:
>>>>>>>
>>>>>>> pcpu_base_addr is already mapped to the KVM address space, but this
>>>>>>> represents the first percpu chunk. To access a per-cpu buffer not
>>>>>>> allocated in the first chunk, add a function which maps all cpu
>>>>>>> buffers corresponding to that per-cpu buffer.
>>>>>>>
>>>>>>> Also add function to clear page table entries for a percpu buffer.
>>>>>>>
>>>>>>
>>>>>> This needs some kind of clarification so that readers can tell whether
>>>>>> you're trying to map all percpu memory or just map a specific
>>>>>> variable.  In either case, you're making a dubious assumption that
>>>>>> percpu memory contains no secrets.
>>>>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrably
>>>>> does contain secrits, invalidating that premise.
>>>>
>>>> The current code unconditionally maps the entire first percpu chunk
>>>> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is
>>>> mainly a simplification for the POC because a lot of core information
>>>> that we need, for example just to switch mm, are stored there (like
>>>> cpu_tlbstate, current_task...).
>>>
>>> I don’t think you should need any of this.
>>>
>>
>> At the moment, the current code does need it. Otherwise it can't switch from
>> kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbstate",
>> and then the page fault handler will fail accessing "current" before calling
>> the kvm page fault handler. So it will double fault or loop on page faults.
>> There are many different places where percpu variables are used, and I have
>> experienced many double fault/page fault loop because of that.
> 
> Now you're experiencing what working on the early PTI code was like :)
> 
> This is why I think you shouldn't touch current in any of this.
> 
>>
>>>>
>>>> If the entire first percpu chunk effectively has secret then we will
>>>> need to individually map only buffers we need. The kvm_copy_percpu_mapping()
>>>> function is added to copy mapping for a specified percpu buffer, so
>>>> this used to map percpu buffers which are not in the first percpu chunk.
>>>>
>>>> Also note that mapping is constrained by PTE (4K), so mapped buffers
>>>> (percpu or not) which do not fill a whole set of pages can leak adjacent
>>>> data store on the same pages.
>>>>
>>>>
>>>
>>> I would take a different approach: figure out what you need and put it in its
>>> own dedicated area, kind of like cpu_entry_area.
>>
>> That's certainly something we can do, like Julian proposed with "Process-local
>> memory allocations": https://lkml.org/lkml/2018/11/22/1240
>>
>> That's fine for buffers allocated from KVM, however, we will still need some
>> core kernel mappings so the thread can run and interrupts can be handled.
>>
>>> One nasty issue you’ll have is vmalloc: the kernel stack is in the
>>> vmap range, and, if you allow access to vmap memory at all, you’ll
>>> need some way to ensure that *unmap* gets propagated. I suspect the
>>> right choice is to see if you can avoid using the kernel stack at all
>>> in isolated mode.  Maybe you could run on the IRQ stack instead.
>>
>> I am currently just copying the task stack mapping into the KVM page table
>> (patch 23) when a vcpu is created:
>>
>>          err = kvm_copy_ptes(tsk->stack, THREAD_SIZE);
>>
>> And this seems to work. I am clearing the mapping when the VM vcpu is freed,
>> so I am making the assumption that the same task is used to create and free
>> a vcpu.
>>
> 
> vCPUs are bound to an mm but not a specific task, right?  So I think
> this is wrong in both directions.
> 

I know, that was yet another shortcut for the POC, I assume there's a 1:1
mapping between a vCPU and task, but I think that's fair with qemu.


> Suppose a vCPU is created, then the task exits, the stack mapping gets
> freed (the core code tries to avoid this, but it does happen), and a
> new stack gets allocated at the same VA with different physical pages.
> Now you're toast :)  On the flip side, wouldn't you crash if a vCPU is
> created and then run on a different thread?

Yes, that's why I have a safety net: before entering KVM isolation I always
check that the current task is mapped in the KVM address space, if not it
gets mapped.

> How important is the ability to enable IRQs while running with the KVM
> page tables?
> 

I can't say, I would need to check but we probably need IRQs at least for
some timers. Sounds like you would really prefer IRQs to be disabled.


alex.