[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <C35C04F5-DEC3-45B3-A049-ED433F34767D@amazon.de>
Date: Fri, 11 Oct 2024 12:56:06 +0000
From: "Mediouni, Mohamed" <mediou@...zon.de>
To: David Hildenbrand <david@...hat.com>
CC: "Mehanna, Fares" <faresx@...zon.de>, "akpm@...ux-foundation.org"
<akpm@...ux-foundation.org>, "ardb@...nel.org" <ardb@...nel.org>,
"arnd@...db.de" <arnd@...db.de>, "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"broonie@...nel.org" <broonie@...nel.org>, "catalin.marinas@....com"
<catalin.marinas@....com>, "james.morse@....com" <james.morse@....com>,
"javierm@...hat.com" <javierm@...hat.com>, "jean-philippe@...aro.org"
<jean-philippe@...aro.org>, "joey.gouly@....com" <joey.gouly@....com>,
"kristina.martsenko@....com" <kristina.martsenko@....com>,
"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "linux-mm@...ck.org" <linux-mm@...ck.org>,
"mark.rutland@....com" <mark.rutland@....com>, "maz@...nel.org"
<maz@...nel.org>, "nh-open-source@...zon.com" <nh-open-source@...zon.com>,
"oliver.upton@...ux.dev" <oliver.upton@...ux.dev>, "ptosi@...gle.com"
<ptosi@...gle.com>, "rdunlap@...radead.org" <rdunlap@...radead.org>, "Kagan,
Roman" <rkagan@...zon.de>, "rppt@...nel.org" <rppt@...nel.org>,
"shikemeng@...weicloud.com" <shikemeng@...weicloud.com>,
"suzuki.poulose@....com" <suzuki.poulose@....com>, "tabba@...gle.com"
<tabba@...gle.com>, "will@...nel.org" <will@...nel.org>,
"yuzenghui@...wei.com" <yuzenghui@...wei.com>
Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use
it
> On 11. Oct 2024, at 14:36, Mediouni, Mohamed <mediou@...zon.de> wrote:
>
>
>
>> On 11. Oct 2024, at 14:04, David Hildenbrand <david@...hat.com> wrote:
>>
>> On 10.10.24 17:52, Fares Mehanna wrote:
>>>>> In a series posted a few years ago [1], a proposal was put forward to allow the
>>>>> kernel to allocate memory local to a mm and thus push it out of reach for
>>>>> current and future speculation-based cross-process attacks. We still believe
>>>>> this is a nice thing to have.
>>>>>
>>>>> However, in the time passed since that post Linux mm has grown quite a few new
>>>>> goodies, so we'd like to explore possibilities to implement this functionality
>>>>> with less effort and churn leveraging the now available facilities.
>>>>>
>>>>> An RFC was posted few months back [2] to show the proof of concept and a simple
>>>>> test driver.
>>>>>
>>>>> In this RFC, we're using the same approach of implementing mm-local allocations
>>>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the
>>>>> pages and flipping the user/supervisor flag on the respective PTEs to make them
>>>>> directly accessible from kernel.
>>>>> In addition to that we are submitting 5 patches to use the secret memory to hide
>>>>> the vCPU gp-regs and fp-regs on arm64 VHE systems.
>>>>
>>>> I'm a bit lost on what exactly we want to achieve. The point where we
>>>> start flipping user/supervisor flags confuses me :)
>>>>
>>>> With secretmem, you'd get memory allocated that
>>>> (a) Is accessible by user space -- mapped into user space.
>>>> (b) Is inaccessible by kernel space -- not mapped into the direct map
>>>> (c) GUP will fail, but copy_from / copy_to user will work.
>>>>
>>>>
>>>> Another way, without secretmem, would be to consider these "secrets"
>>>> kernel allocations that can be mapped into user space using mmap() of a
>>>> special fd. That is, they wouldn't have their origin in secretmem, but
>>>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP
>>>> with vm_insert_pages(), manually removing them from the directmap.
>>>>
>>>> But, I am not sure who is supposed to access what. Let's explore the
>>>> requirements. I assume we want:
>>>>
>>>> (a) Pages accessible by user space -- mapped into user space.
>>>> (b) Pages inaccessible by kernel space -- not mapped into the direct map
>>>> (c) GUP to fail (no direct map).
>>>> (d) copy_from / copy_to user to fail?
>>>>
>>>> And on top of that, some way to access these pages on demand from kernel
>>>> space? (temporary CPU-local mapping?)
>>>>
>>>> Or how would the kernel make use of these allocations?
>>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>> Hi David,
>>
>> Hi Fares!
>>
>>> Thanks for taking a look at the patches!
>>> We're trying to allocate a kernel memory that is accessible to the kernel but
>>> only when the context of the process is loaded.
>>> So this is a kernel memory that is not needed to operate the kernel itself, it
>>> is to store & process data on behalf of a process. The requirement for this
>>> memory is that it would never be touched unless the process is scheduled on this
>>> core. otherwise any other access will crash the kernel.
>>> So this memory should only be directly readable and writable by the kernel, but
>>> only when the process context is loaded. The memory shouldn't be readable or
>>> writable by the owner process at all.
>>> This is basically done by removing those pages from kernel linear address and
>>> attaching them only in the process mm_struct. So during context switching the
>>> kernel loses access to the secret memory scheduled out and gain access to the
>>> new process secret memory.
>>> This generally protects against speculation attacks, and if other process managed
>>> to trick the kernel to leak data from memory. In this case the kernel will crash
>>> if it tries to access other processes secret memory.
>>> Since this memory is special in the sense that it is kernel memory but only make
>>> sense in the term of the owner process, I tried in this patch series to explore
>>> the possibility of reusing memfd_secret() to allocate this memory in user virtual
>>> address space, manage it in a VMA, flipping the permissions while keeping the
>>> control of the mapping exclusively with the kernel.
>>> Right now it is:
>>> (a) Pages not accessible by user space -- even though they are mapped into user
>>> space, the PTEs are marked for kernel usage.
>>
>> Ah, that is the detail I was missing, now I see what you are trying to achieve, thanks!
>>
>> It is a bit architecture specific, because ... imagine architectures that have separate kernel+user space page table hierarchies, and not a simple PTE flag to change access permissions between kernel/user space.
>>
>> IIRC s390 is one such architecture that uses separate page tables for the user-space + kernel-space portions.
>>
>>> (b) Pages accessible by kernel space -- even though they are not mapped into the
>>> direct map, the PTEs in uvaddr are marked for kernel usage.
>>> (c) copy_from / copy_to user won't fail -- because it is in the user range, but
>>> this can be fixed by allocating specific range in user vaddr to this feature
>>> and check against this range there.
>>> (d) The secret memory vaddr is guessable by the owner process -- that can also
>>> be fixed by allocating bigger chunk of user vaddr for this feature and
>>> randomly placing the secret memory there.
>>> (e) Mapping is off-limits to the owner process by marking the VMA as locked,
>>> sealed and special.
>>
>> Okay, so in this RFC you are jumping through quite some hoops to have a kernel allocation unmapped from the direct map but mapped into a per-process page table only accessible by kernel space. :)
>>
>> So you really don't want this mapped into user space at all (consequently, no GUP, no access, no copy_from_user ...). In this RFC it's mapped but turned inaccessible by flipping the "kernel vs. user" switch.
>>
>>> Other alternative (that was implemented in the first submission) is to track those
>>> allocations in a non-shared kernel PGD per process, then handle creating, forking
>>> and context-switching this PGD.
>>
>> That sounds like a better approach. So we would remove the pages from the shared kernel direct map and map them into a separate kernel-portion in the per-MM page tables?
>>
>> Can you envision that would also work with architectures like s390x? I assume we would not only need the per-MM user space page table hierarchy, but also a per-MM kernel space page table hierarchy, into which we also map the common/shared-among-all-processes kernel space page tables (e.g., directmap).
> Yes, that’s also applicable to arm64. There’s currently no separate per-mm user space page hierarchy there.
typo, read kernel
Thanks,
-Mohamed
>>> What I like about the memfd_secret() approach is the simplicity and being arch
>>> agnostic, what I don't like is the increased attack surface by using VMAs to
>>> track those allocations.
>>
>> Yes, but memfd_secret() was really design for user space to hold secrets. But I can see how you came to this solution.
>>
>>> I'm thinking of working on a PoC to implement the first approach of using a
>>> non-shared kernel PGD for secret memory allocations on arm64. This includes
>>> adding kernel page table per process where all PGDs are shared but one which
>>> will be used for secret allocations mapping. And handle the fork & context
>>> switching (TTBR1 switching(?)) correctly for the secret memory PGD.
>>> What do you think? I'd really appreciate opinions and possible ways forward.
>>
>> Naive question: does arm64 rather resemble the s390x model or the x86-64 model?
> arm64 has separate page tables for kernel and user-mode. Except for the KPTI case, the kernel page tables aren’t swapped per-process and stay the same all the time.
>
> Thanks,
> -Mohamed
>> --
>> Cheers,
>>
>> David / dhildenb
>>
>
Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Powered by blists - more mailing lists