linux-kernel - Re: arm64 MTE tag storage reuse

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70d77490-9036-48ac-afc9-4b976433070d@redhat.com>
Date: Tue, 20 Feb 2024 13:05:42 +0100
From: David Hildenbrand <david@...hat.com>
To: Alexandru Elisei <alexandru.elisei@....com>, catalin.marinas@....com,
 will@...nel.org, oliver.upton@...ux.dev, maz@...nel.org,
 james.morse@....com, suzuki.poulose@....com, yuzenghui@...wei.com,
 pcc@...gle.com, steven.price@....com, anshuman.khandual@....com,
 eugenis@...gle.com, kcc@...gle.com, hyesoo.yu@...sung.com, rppt@...nel.org,
 akpm@...ux-foundation.org, peterz@...radead.org, konrad.wilk@...cle.com,
 willy@...radead.org, jgross@...e.com, hch@....de, geert@...ux-m68k.org,
 vitaly.wool@...sulko.com, ddstreet@...e.org, sjenning@...hat.com,
 hughd@...gle.com, linux-arm-kernel@...ts.infradead.org,
 linux-kernel@...r.kernel.org, linux-arch@...r.kernel.org, linux-mm@...ck.org
Subject: Re: arm64 MTE tag storage reuse - alternatives to MIGRATE_CMA

On 20.02.24 12:26, Alexandru Elisei wrote:
> Hello,
> 

Hi!

> This is a request to discuss alternatives to the current approach for
> reusing the MTE tag storage memory for data allocations [1]. Each iteration
> of the series uncovered new issues, the latest being that memory allocation
> is being performed in atomic contexts [2]; I would like to start a
> discussion regarding possible alternative, which would integrate better
> with the memory management code.
> 
> This is a high level overview of the current approach:
> 
>   * Tag storage pages are put on the MIGRATE_CMA lists, meaning they can be
>     used for data allocations like (almost) any other page in the system.
> 
>   * When a page is allocated as tagged, the corresponding tag storage is
>     also allocated.
> 
>   * There's a static relationship between a page and the location in memory
>     where its tags are stored. Because of this, if the corresponding tag
>     storage is used for data, the tag storage page is migrated.
> 
> Although this is the most generic approach because tag storage pages are
> treated like normal pages, it has some disadvantages:
> 
>   * HW KASAN (MTE in the kernel) cannot be used. The kernel allocates memory
>     in atomic context, where migration is not possible.
> 
>   * Tag storage pages cannot be themselves tagged, and this means that all
>     CMA pages, even those which aren't tag storage, cannot be used for
>     tagged allocations.
> 
>   * Page migration is costly, and a process that uses MTE can experience
>     measurable slowdowns if the tag storage it requires is in use for data.
>     There might be ways to reduce this cost (by reducing the likelihood that
>     tag storage pages are allocated), but it cannot be completely
>     eliminated.
> 
>   * Worse yet, a userspace process can use a tag storage page in such a way
>     that migration is effectively impossible [3],[4].  A malicious process
>     can make use of this to prevent the allocation of tag storage for other
>     processes in the system, leading to a degraded experience for the
>     affected processes. Worst case scenario, progress becomes impossible for
>     those processes.
> 
> One alternative approach I'm looking at right now is cleancache. Cleancache
> was removed in v5.17 (commit 0a4ee518185e) because the only backend, the
> tmem driver, had been removed earlier (in v5.3, commit 814bbf49dcd0).
> 
> With this approach, MTE tag storage would be implemented as a driver
> backend for cleancache. When a tag storage page is needed for storing tags,
> the page would simply be dropped from the cache (cleancache_get_page()
> returns -1).

With large folios in place, we'd likely want to investigate not working 
on individual pages, but on (possibly large) folios instead.

> 
> I believe this is a very good fit for tag storage reuse, because it allows
> tag storage to be allocated even in atomic contexts, which enables MTE in
> the kernel. As a bonus, all of the changes to MM from the current approach
> wouldn't be needed, as tag storage allocation can be handled entirely in
> set_ptes_at(), copy_*highpage() or arch_swap_restore().
> 
> Is this a viable approach that would be upstreamable? Are there other
> solutions that I haven't considered? I'm very much open to any alternatives
> that would make tag storage reuse viable.

As raised recently, I had similar ideas with something like virtio-mem 
in the past (wanted to call it virtio-tmem back then), but didn't have 
time to look into it yet.

I considered both, using special device memory as "cleancache" backend, 
and using it as backend storage for something similar to zswap. We would 
not need a memmap/"struct page" for that special device memory, which 
reduces memory overhead and makes "adding more memory" a more reliable 
operation.

Using it as "cleancache" backend does make some things a lot easier.

The idea would be to provide a variable amount of additional memory to a 
VM, that can be reclaimed easily and reliably on demand.

The details are a bit more involved, but in essence, imagine a special 
physical memory region that is provided by a the hypervisor via a device 
to the VM. A virtio device "owns" that region and the driver manages it, 
based on requests from the hypervisor.

Similar to virtio-mem, there are ways for the hypervisor to request 
changes to the memory consumption of a device (setting the requested 
size). So when requested to consume less, clean pagecache pages can be 
dropped and the memory can be handed back to the hypervisor.

Of course, likely we would want to consider using "slower" memory in the 
hypervisor to back such a device.

I also thought about better integrating memory reclaim in the 
hypervisor, similar to "MADV_FREE" semantic way. One idea I had was that 
the memory provided by the device might have "special" semantics (as if 
the memory is always marked MADV_FREE), whereby the hypervisor could 
reclaim+discard any memory in that region any time, and the driver would 
have ways to get notified about that, or detect that reclaim happened.

I learned that there are cases where data that is significantly larger 
than main memory might be read repeatedly. As long as there is free 
memory in the hypervisor, it could be used as a cache for clean 
pagecache pages. In contrast to memory ballonning + virtio-mem, that 
memory can be easily and reliably reclaimed. And reclaiming that memory 
cannot really hurt the VM, it would only affect performance.

Long story short: what I had in mind would require similar hooks (again).

In contrast to tmem, with arm64 MTE we could get an actual supported 
cleancache backend fairly easily. I recall that tmem was abandoned in 
XEN and never really reached production quality.

-- 
Cheers,

David / dhildenb