linux-kernel - Re: [PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fV+iad1-7O=63QgS=3AztxPxVHc+tui_2iNbBvxuHwuHQ@mail.gmail.com>
Date: Sat, 8 Feb 2025 04:15:49 -0800
From: Ian Rogers <irogers@...gle.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, 
	Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>, 
	Mark Rutland <mark.rutland@....com>, 
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>, 
	Adrian Hunter <adrian.hunter@...el.com>, Kan Liang <kan.liang@...ux.intel.com>, 
	Sam James <sam@...too.org>, Jesper Juhl <jesperjuhl76@...il.com>, 
	James Clark <james.clark@...aro.org>, Zhongqiu Han <quic_zhonhan@...cinc.com>, 
	Yicong Yang <yangyicong@...ilicon.com>, Thomas Richter <tmricht@...ux.ibm.com>, 
	Michael Petlan <mpetlan@...hat.com>, Veronika Molnarova <vmolnaro@...hat.com>, 
	Anne Macedo <retpolanne@...teo.net>, Dominique Martinet <asmadeus@...ewreck.org>, 
	Jean-Philippe Romain <jean-philippe.romain@...s.st.com>, Junhao He <hejunhao3@...wei.com>, 
	linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org, 
	Krzysztof Łopatowski <krzysztof.m.lopatowski@...il.com>
Subject: Re: [PATCH v2 0/7] Add io_dir to avoid memory overhead from opendir

On Sat, Feb 8, 2025 at 2:58 AM David Laight
<david.laight.linux@...il.com> wrote:
>
> On Fri,  7 Feb 2025 15:24:41 -0800
> Ian Rogers <irogers@...gle.com> wrote:
>
> > glibc's opendir allocates a minimum of 32kb, when called recursively
> > for a directory tree the memory consumption can add up - nearly 300kb
> > during perf start-up when processing modules. Add a stack allocated
> > variant of readdir sized a little more than 1kb
>
> Does 300kB really matter?

For someone on a laptop, probably no. For a company running the
command a large number of times in consolidated and heavily loaded
data centers, where binaries run in memory constrained containers,
memory usage contributes to latency of the 99th percentile slow
executions, which because of the number of machines is something seen
all the time.

That said, do you need to optimize usages where you recurse over the
kernel modules? That's something you wouldn't try to do in a data
centre, so there isn't so much need.

> You seem to be trading memory allocation against system calls.
> My 'gut feel' is that the system call cost will dominate.
> (Unless all the directories are small.)

It isn't so much directories being small as the filenames within them.
getdents64 will vary the size of each entry based on the filename size
which can be up to 256 bytes. As an example, for trace points:
```
$ sudo ls -al /sys/kernel/debug/tracing/events/raw_syscalls/
total 0
drwxr-xr-x 1 root root 0 Feb  8 03:23 .
drwxr-xr-x 1 root root 0 Feb  8 03:20 ..
-rw-r----- 1 root root 0 Feb  8 03:23 enable
-rw-r----- 1 root root 0 Feb  8 03:23 filter
drwxr-xr-x 1 root root 0 Feb  8 03:23 sys_enter
drwxr-xr-x 1 root root 0 Feb  8 03:23 sys_exit
$ sudo ls -al /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter
total 0
drwxr-xr-x 1 root root 0 Feb  8 03:23 .
drwxr-xr-x 1 root root 0 Feb  8 03:23 ..
-rw-r----- 1 root root 0 Feb  8 03:23 enable
-rw-r----- 1 root root 0 Feb  8 03:23 filter
-r--r----- 1 root root 0 Feb  8 03:23 format
-r--r----- 1 root root 0 Feb  8 03:23 hist
-r--r----- 1 root root 0 Feb  8 03:23 id
-rw-r----- 1 root root 0 Feb  8 03:23 trigger
```
So each d_entry is going to cost you like 20 bytes plus the length of
the filename. So 6 entries with filenames varying up to 9 characters,
you need 24 or 32 bytes per entry after padding. If each entry were 32
bytes then you'd need a buffer to hold all the data of size 192 bytes,
the code here is using 1kb. The value used by glibc is between 32KB
and 1MB, which is pretty generous when all you need is 192 bytes.

An issue in perf is that we may recursively descend directories and
use opendir on each level, so the buffer size gets multiplied by the
depth of the tree. So at 100s of KB these memory allocations are some
of the largest seen in profiles of perf - BPF has some large
allocations too.

So why not fix opendir? I've found trying to get things landed in
glibc is slow to glacial [1], and someone once thought those buffer
sizes were good so who am I to argue? The LLVM libc and similar are
making similar smaller than glibc buffer size choices. The addition of
io_dir means we can tune for the use cases that matter to perf, such
as kernel modules, procfs, sysfs, .. in general we don't see
directories with lots of entries and we're a long way from the 256
byte file name max length. While the code could potentially increase
the number of system calls, in practice it doesn't.

Thanks,
Ian

[1] I sent patches bringing back the "stack:tid" convention in
/prod/pid/maps by naming stacks on allocation (an extra syscall). This
can be used, for example, to identify stack vs heap memory accesses.
Linux used to support this but Linus himself removed the support - the
code to do it had quadratic execution time due to searching for each
stack allocation the corresponding thread stack pointer. Fixing
opendir memory size and stack naming in glibc would be good things
imo. Were libc better we may not need this series (we still know our
use cases better than a generic implementation).

> There might, of course, be other code in glibc that is designed
> to make it all slower than necessary.
>
>         David