[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f9ba14a-909b-4b49-b1de-3dc98b31aee0@redhat.com>
Date: Fri, 18 Oct 2024 20:52:43 +0200
From: David Hildenbrand <david@...hat.com>
To: Fares Mehanna <faresx@...zon.de>
Cc: akpm@...ux-foundation.org, ardb@...nel.org, arnd@...db.de,
bhelgaas@...gle.com, broonie@...nel.org, catalin.marinas@....com,
james.morse@....com, javierm@...hat.com, jean-philippe@...aro.org,
joey.gouly@....com, kristina.martsenko@....com, kvmarm@...ts.linux.dev,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, mark.rutland@....com, maz@...nel.org, mediou@...zon.de,
nh-open-source@...zon.com, oliver.upton@...ux.dev, ptosi@...gle.com,
rdunlap@...radead.org, rkagan@...zon.de, rppt@...nel.org,
shikemeng@...weicloud.com, suzuki.poulose@....com, tabba@...gle.com,
will@...nel.org, yuzenghui@...wei.com
Subject: Re: [RFC PATCH 0/7] support for mm-local memory allocations and use
it
On 11.10.24 16:25, Fares Mehanna wrote:
>>>
>>>
>>>> On 11. Oct 2024, at 14:36, Mediouni, Mohamed <mediou@...zon.de> wrote:
>>>>
>>>>
>>>>
>>>>> On 11. Oct 2024, at 14:04, David Hildenbrand <david@...hat.com> wrote:
>>>>>
>>>>> On 10.10.24 17:52, Fares Mehanna wrote:
>>>>>>>> In a series posted a few years ago [1], a proposal was put forward to allow the
>>>>>>>> kernel to allocate memory local to a mm and thus push it out of reach for
>>>>>>>> current and future speculation-based cross-process attacks. We still believe
>>>>>>>> this is a nice thing to have.
>>>>>>>>
>>>>>>>> However, in the time passed since that post Linux mm has grown quite a few new
>>>>>>>> goodies, so we'd like to explore possibilities to implement this functionality
>>>>>>>> with less effort and churn leveraging the now available facilities.
>>>>>>>>
>>>>>>>> An RFC was posted few months back [2] to show the proof of concept and a simple
>>>>>>>> test driver.
>>>>>>>>
>>>>>>>> In this RFC, we're using the same approach of implementing mm-local allocations
>>>>>>>> piggy-backing on memfd_secret(), using regular user addresses but pinning the
>>>>>>>> pages and flipping the user/supervisor flag on the respective PTEs to make them
>>>>>>>> directly accessible from kernel.
>>>>>>>> In addition to that we are submitting 5 patches to use the secret memory to hide
>>>>>>>> the vCPU gp-regs and fp-regs on arm64 VHE systems.
>>>>>>>
>>>>>>> I'm a bit lost on what exactly we want to achieve. The point where we
>>>>>>> start flipping user/supervisor flags confuses me :)
>>>>>>>
>>>>>>> With secretmem, you'd get memory allocated that
>>>>>>> (a) Is accessible by user space -- mapped into user space.
>>>>>>> (b) Is inaccessible by kernel space -- not mapped into the direct map
>>>>>>> (c) GUP will fail, but copy_from / copy_to user will work.
>>>>>>>
>>>>>>>
>>>>>>> Another way, without secretmem, would be to consider these "secrets"
>>>>>>> kernel allocations that can be mapped into user space using mmap() of a
>>>>>>> special fd. That is, they wouldn't have their origin in secretmem, but
>>>>>>> in KVM as a kernel allocation. It could be achieved by using VM_MIXEDMAP
>>>>>>> with vm_insert_pages(), manually removing them from the directmap.
>>>>>>>
>>>>>>> But, I am not sure who is supposed to access what. Let's explore the
>>>>>>> requirements. I assume we want:
>>>>>>>
>>>>>>> (a) Pages accessible by user space -- mapped into user space.
>>>>>>> (b) Pages inaccessible by kernel space -- not mapped into the direct map
>>>>>>> (c) GUP to fail (no direct map).
>>>>>>> (d) copy_from / copy_to user to fail?
>>>>>>>
>>>>>>> And on top of that, some way to access these pages on demand from kernel
>>>>>>> space? (temporary CPU-local mapping?)
>>>>>>>
>>>>>>> Or how would the kernel make use of these allocations?
>>>>>>>
>>>>>>> --
>>>>>>> Cheers,
>>>>>>>
>>>>>>> David / dhildenb
>>>>>> Hi David,
>>>>>
>>>>> Hi Fares!
>>>>>
>>>>>> Thanks for taking a look at the patches!
>>>>>> We're trying to allocate a kernel memory that is accessible to the kernel but
>>>>>> only when the context of the process is loaded.
>>>>>> So this is a kernel memory that is not needed to operate the kernel itself, it
>>>>>> is to store & process data on behalf of a process. The requirement for this
>>>>>> memory is that it would never be touched unless the process is scheduled on this
>>>>>> core. otherwise any other access will crash the kernel.
>>>>>> So this memory should only be directly readable and writable by the kernel, but
>>>>>> only when the process context is loaded. The memory shouldn't be readable or
>>>>>> writable by the owner process at all.
>>>>>> This is basically done by removing those pages from kernel linear address and
>>>>>> attaching them only in the process mm_struct. So during context switching the
>>>>>> kernel loses access to the secret memory scheduled out and gain access to the
>>>>>> new process secret memory.
>>>>>> This generally protects against speculation attacks, and if other process managed
>>>>>> to trick the kernel to leak data from memory. In this case the kernel will crash
>>>>>> if it tries to access other processes secret memory.
>>>>>> Since this memory is special in the sense that it is kernel memory but only make
>>>>>> sense in the term of the owner process, I tried in this patch series to explore
>>>>>> the possibility of reusing memfd_secret() to allocate this memory in user virtual
>>>>>> address space, manage it in a VMA, flipping the permissions while keeping the
>>>>>> control of the mapping exclusively with the kernel.
>>>>>> Right now it is:
>>>>>> (a) Pages not accessible by user space -- even though they are mapped into user
>>>>>> space, the PTEs are marked for kernel usage.
>>>>>
>>>>> Ah, that is the detail I was missing, now I see what you are trying to achieve, thanks!
>>>>>
>>>>> It is a bit architecture specific, because ... imagine architectures that have separate kernel+user space page table hierarchies, and not a simple PTE flag
>> to change access permissions between kernel/user space.
>>>>>
>>>>> IIRC s390 is one such architecture that uses separate page tables for the user-space + kernel-space portions.
>>>>>
>>>>>> (b) Pages accessible by kernel space -- even though they are not mapped into the
>>>>>> direct map, the PTEs in uvaddr are marked for kernel usage.
>>>>>> (c) copy_from / copy_to user won't fail -- because it is in the user range, but
>>>>>> this can be fixed by allocating specific range in user vaddr to this feature
>>>>>> and check against this range there.
>>>>>> (d) The secret memory vaddr is guessable by the owner process -- that can also
>>>>>> be fixed by allocating bigger chunk of user vaddr for this feature and
>>>>>> randomly placing the secret memory there.
>>>>>> (e) Mapping is off-limits to the owner process by marking the VMA as locked,
>>>>>> sealed and special.
>>>>>
>>>>> Okay, so in this RFC you are jumping through quite some hoops to have a kernel allocation unmapped from the direct map but mapped into a per-process page
>> table only accessible by kernel space. :)
>>>>>
>>>>> So you really don't want this mapped into user space at all (consequently, no GUP, no access, no copy_from_user ...). In this RFC it's mapped but turned
>> inaccessible by flipping the "kernel vs. user" switch.
>>>>>
>>>>>> Other alternative (that was implemented in the first submission) is to track those
>>>>>> allocations in a non-shared kernel PGD per process, then handle creating, forking
>>>>>> and context-switching this PGD.
>>>>>
>>>>> That sounds like a better approach. So we would remove the pages from the shared kernel direct map and map them into a separate kernel-portion in the per-MM
>> page tables?
>>>>>
>>>>> Can you envision that would also work with architectures like s390x? I assume we would not only need the per-MM user space page table hierarchy, but also a
>> per-MM kernel space page table hierarchy, into which we also map the common/shared-among-all-processes kernel space page tables (e.g., directmap).
>>>> Yes, that’s also applicable to arm64. There’s currently no separate per-mm user space page hierarchy there.
>>> typo, read kernel
>>
>>
>> Okay, thanks. So going into that direction makes more sense.
>>
>> I do wonder if we really have to deal with fork() ... if the primary
>> users don't really have meaning in the forked child (e.g., just like
>> fork() with KVM IIRC) we might just get away by "losing" these
>> allocations in the child process.
>>
>> Happy to learn why fork() must be supported.
>
> It really depends on the use cases of the kernel secret allocation, but in my
> mind a troubling scenario:
> 1. Process A had a resource X.
> 2. Kernel decided to keep some data related to resource X in process A secret
> memory.
> 3. Process A decided to fork, now process B share the resource X.
> 4. Process B started using resource X. <-- This will crash the kernel as the
> used kernel page table on process B has no mapping for the secret memory used
> in resource X.
>
> I haven't tried to trigger this crash myself though.
>
Right, and if we can rule out any users that are supposed to work after
fork(), we can just disregard that in the first version.
I never played with this, but let's assume you make use of these
mm-local allocations in KVM context.
What would happens if you fork() with a KVM fd and try accessing that fd
from the other process using ioctls? I recall that KVM will not be
"duplicated".
What would happen if you send that fd over to a completely different
process and try accessing that fd from the other process using ioctls?
Of course, question being: if you have MM-local allocations in both
cases and there is suddenly a different MM ... assuming that both cases
are even possible (if they are not possible, great! :) ).
I think I am supposed to know if these things are possible or not and
what would happen, but it's late Friday and my brain is begging for some
Weekend :D
> I didn't think in depth about this issue yet, but I need to because duplicating
> the secret memory mappings in the new forked process is easy (To give kernel
> access on the secret memory), but tearing them down across all forked processes
> is a bit complicated (To clean stale mappings on parent/child processes). Right
> now tearing down the mapping will only happen on mm_struct which allocated the
> secret memory.
If an allocation is MM-local, I would assume that fork() would
*duplicate* that allocation (leaving CoW out of the picture :D ), but
that's where the fun begins (see above regarding my confusion about KVM
and fork() behavior ... ).
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists