[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEf4BzaQYqPfe2Qb5n71JVAAD3-1Q7q2+_cnQMQEa43DvV5PCQ@mail.gmail.com>
Date: Fri, 1 Nov 2024 11:34:48 -0700
From: Andrii Nakryiko <andrii.nakryiko@...il.com>
To: Josh Poimboeuf <jpoimboe@...nel.org>
Cc: x86@...nel.org, Peter Zijlstra <peterz@...radead.org>,
Steven Rostedt <rostedt@...dmis.org>, Ingo Molnar <mingo@...nel.org>,
Arnaldo Carvalho de Melo <acme@...nel.org>, linux-kernel@...r.kernel.org,
Indu Bhagat <indu.bhagat@...cle.com>, Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>,
Namhyung Kim <namhyung@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>, linux-perf-users@...r.kernel.org,
Mark Brown <broonie@...nel.org>, linux-toolchains@...r.kernel.org,
Jordan Rome <jordalgo@...a.com>, Sam James <sam@...too.org>, linux-trace-kernel@...r.kerne.org,
Jens Remus <jremus@...ux.ibm.com>, Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Florian Weimer <fweimer@...hat.com>, Andy Lutomirski <luto@...nel.org>, bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH v3 09/19] unwind: Introduce sframe user space unwinding
+cc bpf@ where people care about stack traces and profiling as well
(please cc bpf@...r.kernel.org for next revisions as well, I'm sure a
bunch of folks would appreciate it and have something useful to say)
On Thu, Oct 31, 2024 at 4:03 PM Josh Poimboeuf <jpoimboe@...nel.org> wrote:
>
> On Thu, Oct 31, 2024 at 01:57:10PM -0700, Andrii Nakryiko wrote:
> > > > what if ip_off is larger than 4GB? ELF section can be bigger than 4GB, right?
> > >
> > > That's baked into sframe v2.
> >
> > I believe we do have large production binaries with more than 4GB of
> > text, what are we going to do about them? It would be interesting to
> > hear sframe people's opinion. Adding such a far-reaching new format in
> > 2024 with these limitations is kind of sad. At the very least maybe we
> > should allow some form of chaining sframe definitions to cover more
> > than 4GB segments? Please CC relevant folks, I'm wondering what
> > they're thinking about this.
>
> Personally I find the idea of a single 4GB+ text segment pretty
> surprising as I've never seen anything even close to that.
I grabbed one of Meta production servers running one of the big-ish (I
don't know if it's even the largest, most probably not) service. Here
are first few sections from /proc/pid/maps belonging to the main
binary:
00200000-170ad000 r--p 00000000 07:01 5
172ac000-498e7000 r-xp 16eac000 07:01 5
49ae7000-49b8b000 r--p 494e7000 07:01 5
49d8b000-4a228000 rw-p 4958b000 07:01 5
4a228000-4c677000 rw-p 00000000 00:00 0
4c800000-4ca00000 r-xp 49c00000 07:01 5
4ca00000-4f600000 r-xp 49e00000 07:01 5
4f600000-5b270000 r-xp 4ca00000 07:01 5
Few observations:
1) There are 4 executable segments in just the first 8 entries.
2) Their total size is already approaching 1.5GB:
>>> ((0x170ad000 - 0x200000) + (0x5b270000 - 0x4f600000) + (0x498e7000 - 0x172ac000)) / 1024 / 1024
1361.34375
I don't know about you, but from my experience things like code size
tend to just grow over time, rarely it shrinks (and even that usually
requires tremendous and focused efforts).
>
> Anyway it's iterative development and not everybody's requirements are
> clear from day 1. Which is why we're discussing it now. I think there
> are already plans to do an sframe v3.
Of course, which is why I'm providing this feedback. But it would be
nice to avoid having to support a zoo of versions if we already know
there are practical limitations that we are not that far from hitting.
>
> > > > > + if (text_vma) {
> > > > > + pr_warn_once("%s[%d]: multiple EXEC segments unsupported\n",
> > > > > + current->comm, current->pid);
> > > >
> > > > is this just something that fundamentally can't be supported by SFrame
> > > > format? Or just an implementation simplification?
> > >
> > > It's a simplification I suppose.
> >
> > That's a rather random limitation, IMO... How hard would it be to not
> > make that assumption?
>
> It's definitely not random, there's no need to complicate the code if
> this condition doesn't exist.
Sorry, I'm probably dense and missing something. But from the example
process above, isn't this check violated already? Or it's two
different things? Not sure, honestly.
>
> > > > It's not illegal to have an executable with multiple VM_EXEC segments,
> > > > no? Should this be a pr_warn_once() then?
> > >
> > > I don't know, is it allowed? I've never seen it in practice. The
> >
> > I'm pretty sure you can do that with a custom linker script, at the
> > very least. Normally this probably won't happen, but I don't think
> > Linux dictates how many executable VMAs an application can have.
> > And it probably just naturally happens for JIT-ted applications (Java,
> > Go, etc).
>
> Actually I just double checked and even the kernel's ELF loader assumes
> that each executable has only a single text start+end address pair.
See above, very confused by such assumptions, but I'm hoping we are
talking about two different things here.
>
> > > pr_warn_once() is not reporting that it's illegal but rather that this
> > > corner case actually exists and maybe needs to be looked at.
> >
> > This warn() will be logged across millions of machines in the fleet,
> > triggering alarms, people looking at this, making custom internal
> > patches to disable the known-to-happen warn. Why do we need all this?
> > This is an issue that is trivial to trigger by user process that's not
> > doing anything illegal. Why?
>
> There's no point in adding complexity to support some hypothetical. I
> can remove the printk though.
We are talking about fundamental things like format for supporting
frame pointer-less stack trace capture. It will take years to adopt
SFrame everywhere, so I think it's prudent to think a bit ahead beyond
just saying "no real application should need more than 4GB text", IMO.
>
> --
> Josh
Powered by blists - more mailing lists