linux-kernel - Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABdmKX2LhrcyDM0r1tytt2vKLuCLGsxZaGHgN+u1hUmEMXuGtw@mail.gmail.com>
Date: Fri, 4 Apr 2025 18:57:25 -0700
From: "T.J. Mercier" <tjmercier@...gle.com>
To: Christian König <christian.koenig@....com>
Cc: Maxime Ripard <mripard@...nel.org>, Dave Airlie <airlied@...il.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Marek Szyprowski <m.szyprowski@...sung.com>, 
	Robin Murphy <robin.murphy@....com>, Sumit Semwal <sumit.semwal@...aro.org>, 
	Benjamin Gaignard <benjamin.gaignard@...labora.com>, Brian Starkey <Brian.Starkey@....com>, 
	John Stultz <jstultz@...gle.com>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
	Thomas Zimmermann <tzimmermann@...e.de>, Simona Vetter <simona@...ll.ch>, Tomasz Figa <tfiga@...omium.org>, 
	Mauro Carvalho Chehab <mchehab@...nel.org>, Ben Woodard <woodard@...hat.com>, 
	Hans Verkuil <hverkuil@...all.nl>, 
	Laurent Pinchart <laurent.pinchart+renesas@...asonboard.com>, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, iommu@...ts.linux.dev, 
	linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org, 
	linaro-mm-sig@...ts.linaro.org
Subject: Re: [PATCH RFC 00/12] dma: Enable dmem cgroup tracking

On Fri, Apr 4, 2025 at 1:47 AM Christian König <christian.koenig@....com> wrote:
>
> Hi Maxime,
>
> Am 03.04.25 um 17:47 schrieb Maxime Ripard:
> > On Thu, Apr 03, 2025 at 09:39:52AM +0200, Christian König wrote:
> >>> For the UMA GPU case where there is no device memory or eviction
> >>> problem, perhaps a configurable option to just say account memory in
> >>> memcg for all allocations done by this process, and state yes you can
> >>> work around it with allocation servers or whatever but the behaviour
> >>> for well behaved things is at least somewhat defined.
> >> We can have that as a workaround, but I think we should approach that
> >> differently.
> >>
> >> With upcoming CXL even coherent device memory is exposed to the core
> >> OS as NUMA memory with just a high latency.
> >>
> >> So both in the CXL and UMA case it actually doesn't make sense to
> >> allocate the memory through the driver interfaces any more. With
> >> AMDGPU for example we are just replicating mbind()/madvise() within
> >> the driver.
> >>
> >> Instead what the DRM subsystem should aim for is to allocate memory
> >> using the normal core OS functionality and then import it into the
> >> driver.
> >>
> >> AMD, NVidia and Intel have HMM working for quite a while now but it
> >> has some limitations, especially on the performance side.
> >>
> >> So for AMDGPU we are currently evaluating udmabuf as alternative. That
> >> seems to be working fine with different NUMA nodes, is perfectly memcg
> >> accounted and gives you a DMA-buf which can be imported everywhere.
> >>
> >> The only show stopper might be the allocation performance, but even if
> >> that's the case I think the ongoing folio work will properly resolve
> >> that.
> > I mean, no, the showstopper to that is that using udmabuf has the
> > assumption that you have an IOMMU for every device doing DMA, which is
> > absolutely not true on !x86 platforms.
> >
> > It might be true for all GPUs, but it certainly isn't for display
> > controllers, and it's not either for codecs, ISPs, and cameras.
> >
> > And then there's the other assumption that all memory is under the
> > memory allocator control, which isn't the case on most recent platforms
> > either.
> >
> > We *need* to take CMA into account there, all the carved-out, device
> > specific memory regions, and the memory regions that aren't even under
> > Linux supervision like protected memory that is typically handled by the
> > firmware and all you get is a dma-buf.
> >
> > Saying that it's how you want to workaround it on AMD is absolutely
> > fine, but DRM as a whole should certainly not aim for that, because it
> > can't.
>
> A bunch of good points you bring up here but it sounds like you misunderstood me a bit.
>
> I'm certainly *not* saying that we should push for udmabuf for everything, that is clearly use case specific.
>
> For use cases like CMA or protected carve-out the question what to do doesn't even arise in the first place.
>
> When you have CMA which dynamically steals memory from the core OS then of course it should be accounted to memcg.
>
> When you have carve-out which the core OS memory management doesn't even know about then it should certainly be handled by dmem.
>
> The problematic use cases are the one where a buffer can sometimes be backed by system memory and sometime by something special. For this we don't have a good approach what to do since every approach seems to have a draw back for some use case.

This reminds me of memory.memsw in cgroup v1, where both resident and
swapped memory show up under the same memcg counter. In this dmem
scenario it's similar but across two different cgroup controllers
instead of two different types of system memory under the same
controller.

memsw doesn't exist in v2, and users are asking for it back. [1] I
tend to agree that a combined counter is useful as I don't see a great
way to apply meaningful limits to individual counters (or individual
controller limits in the dmem+memcg case) when multiple cgroups are
involved and eviction can cause memory to be transferred from one
place to another. Sorry I'm not really offering a solution to this,
but I feel like only transferring the charge between cgroups is a
partial solution since the enforcement by the kernel is independent
for each controller. So yeah as Dave and Sima said for accounting I
guess it works, and maybe that's good enough if you have userspace
enforcement that's smart enough to look in all the different places.
But then there are the folks asking for kernel enforcement. Maybe just
accounting as best we can is a good place to start?

[1] https://lore.kernel.org/all/20250319064148.774406-5-jingxiangzeng.cas@gmail.com/