lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABdmKX2TSjcgQws6zBCWDkFH9S6f1m6aOHP+-n=WjHgKvLH-mw@mail.gmail.com>
Date: Tue, 15 Apr 2025 11:13:31 -0700
From: "T.J. Mercier" <tjmercier@...gle.com>
To: Christian König <christian.koenig@....com>
Cc: sumit.semwal@...aro.org, ast@...nel.org, daniel@...earbox.net, 
	andrii@...nel.org, martin.lau@...ux.dev, skhan@...uxfoundation.org, 
	linux-kernel@...r.kernel.org, linux-media@...r.kernel.org, 
	dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org, 
	linux-doc@...r.kernel.org, bpf@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, android-mm@...gle.com, simona@...ll.ch, 
	corbet@....net, eddyz87@...il.com, song@...nel.org, yonghong.song@...ux.dev, 
	john.fastabend@...il.com, kpsingh@...nel.org, sdf@...ichev.me, 
	jolsa@...nel.org, mykolal@...com
Subject: Re: [PATCH 0/4] Replace CONFIG_DMABUF_SYSFS_STATS with BPF

On Tue, Apr 15, 2025 at 2:03 AM Christian König
<christian.koenig@....com> wrote:
>
> Am 15.04.25 um 00:52 schrieb T.J. Mercier:
> > Until CONFIG_DMABUF_SYSFS_STATS was added [1] it was only possible to
> > perform per-buffer accounting with debugfs which is not suitable for
> > production environments. Eventually we discovered the overhead with
> > per-buffer sysfs file creation/removal was significantly impacting
> > allocation and free times, and exacerbated kernfs lock contention. [2]
> > dma_buf_stats_setup() is responsible for 39% of single-page buffer
> > creation duration, or 74% of single-page dma_buf_export() duration when
> > stressing dmabuf allocations and frees.
> >
> > I prototyped a change from per-buffer to per-exporter statistics with a
> > RCU protected list of exporter allocations that accommodates most (but
> > not all) of our use-cases and avoids almost all of the sysfs overhead.
> > While that adds less overhead than per-buffer sysfs, and less even than
> > the maintenance of the dmabuf debugfs_list, it's still *additional*
> > overhead on top of the debugfs_list and doesn't give us per-buffer info.
> >
> > This series uses the existing dmabuf debugfs_list to implement a BPF
> > dmabuf iterator, which adds no overhead to buffer allocation/free and
> > provides per-buffer info.
>
> Really interesting suggestion. I was expecting something like cgroups, but bpf is certainly an option as well.
>
> How do you then use bpf to account the buffers? E.g. are you interacting with cgroups or have sysfs procedure to expose the list or how does that work?

Where currently we read through all of /sys/kernel/dmabuf/buffers/,
with this we can load or pin a bpf program (like
tools/testing/selftests/bpf/progs/dmabuf_iter.c) and then just cat
(and parse) /sys/fs/bpf/dmabufs to get all per-buffer info one go.

The attribution of buffers to processes is currently done by looking
through procfs for fd and map references to dmabufs. That part is
still slow, and provides no limitation on who can allocate how much,
so I think cgroups is still the main potential tool for that. We have
a program that does all the scanning work which is called on-demand
for some use cases, and also manually by users:
https://cs.android.com/android/platform/superproject/main/+/main:system/memory/libmeminfo/libdmabufinfo/tools/dmabuf_dump.cpp

The per-buffer information is used for accounting kernel-only buffers
that don't show up in procfs, and for partially mapped buffers without
fd references where the total buffer size isn't otherwise known. Also
sometimes (manual debugging or bugreports) it's useful just to know
how much memory in total is tied up in dmabufs regardless of who
allocated it because it can be gigabytes due to bugs or crazy program
behaviors; the per buffer info is a faster way to get that then
reading through all of procfs even if you assume everything is
viewable in procfs.

> Additional to that why using DMA-buf for accounting in the first place? See DMA-buf is for sharing buffers and only a minimal fraction of buffers usually need to get shared. Everything else is just massive overhead.

Well we need some way to account all DMA-buf memory because it
consumes a significant portion of total device memory. Even more so
lately where they're used to store >1G AI models for execution on
accelerator hardware. I've attached an example of dmabuf_dump output
to give you an idea of how many buffers we're talking about, and most
of those are (or will be, when an app goes to foreground) shared among
multiple processes and/or drivers.

> > While the kernel must have CONFIG_DEBUG_FS for
> > the dmabuf_iter to be available, debugfs does not need to be mounted.
> > The BPF program loaded by userspace that extracts per-buffer information
> > gets to define its own interface which avoids the lack of ABI stability
> > with debugfs (even if it were mounted).
>
> I think we can make the buffer list independent of CONFIG_DEBUG_FS.

This would be nice. It's a fairly small overhead, and we can make it
less with RCU too. (__dma_buf_debugfs_list_add.png)

> > As this is a replacement for our use of CONFIG_DMABUF_SYSFS_STATS, the
> > last patch is a RFC for removing it from the kernel. Please see my
> > suggestion there regarding the timeline for that.
>
> Oh, yes please!

I thought you might be happy about this. :)

> Regards,
> Christian.
>
> >
> > [1] https://lore.kernel.org/linux-media/20201210044400.1080308-1-hridya@google.com/
> > [2] https://lore.kernel.org/all/20220516171315.2400578-1-tjmercier@google.com/
> >
> > T.J. Mercier (4):
> >   dma-buf: Rename and expose debugfs symbols
> >   bpf: Add dmabuf iterator
> >   selftests/bpf: Add test for dmabuf_iter
> >   RFC: dma-buf: Remove DMA-BUF statistics
> >
> >  .../ABI/testing/sysfs-kernel-dmabuf-buffers   |  24 ---
> >  Documentation/driver-api/dma-buf.rst          |   5 -
> >  drivers/dma-buf/Kconfig                       |  15 --
> >  drivers/dma-buf/Makefile                      |   1 -
> >  drivers/dma-buf/dma-buf-sysfs-stats.c         | 202 ------------------
> >  drivers/dma-buf/dma-buf-sysfs-stats.h         |  35 ---
> >  drivers/dma-buf/dma-buf.c                     |  40 +---
> >  include/linux/btf_ids.h                       |   1 +
> >  include/linux/dma-buf.h                       |   6 +
> >  kernel/bpf/Makefile                           |   3 +
> >  kernel/bpf/dmabuf_iter.c                      | 130 +++++++++++
> >  tools/testing/selftests/bpf/config            |   1 +
> >  .../selftests/bpf/prog_tests/dmabuf_iter.c    | 116 ++++++++++
> >  .../testing/selftests/bpf/progs/dmabuf_iter.c |  31 +++
> >  14 files changed, 299 insertions(+), 311 deletions(-)
> >  delete mode 100644 Documentation/ABI/testing/sysfs-kernel-dmabuf-buffers
> >  delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.c
> >  delete mode 100644 drivers/dma-buf/dma-buf-sysfs-stats.h
> >  create mode 100644 kernel/bpf/dmabuf_iter.c
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/dmabuf_iter.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/dmabuf_iter.c
> >
>

View attachment "dmabuf_dump_example.txt" of type "text/plain" (35764 bytes)

Download attachment "__dma_buf_debugfs_list_add.png" of type "image/png" (152831 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ