[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fXh82HpvLhPr9uFtq=HK5V+ZL3xntz7-gspEE5az2+7nw@mail.gmail.com>
Date: Tue, 25 Nov 2025 18:55:39 -0800
From: Ian Rogers <irogers@...gle.com>
To: Eric Biggers <ebiggers@...nel.org>
Cc: Namhyung Kim <namhyung@...nel.org>, Arnaldo Carvalho de Melo <acme@...nel.org>,
James Clark <james.clark@...aro.org>, Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
linux-perf-users@...r.kernel.org, Pablo Galindo <pablogsal@...il.com>,
Fangrui Song <maskray@...rceware.org>
Subject: Re: [PATCH v2 1/2] perf jitdump: Add sym/str-tables to build-ID generation
On Tue, Nov 25, 2025 at 11:29 AM Eric Biggers <ebiggers@...nel.org> wrote:
>
> On Tue, Nov 25, 2025 at 12:07:46AM -0800, Namhyung Kim wrote:
> > It was reported that python backtrace with JIT dump was broken after the
> > change to built-in SHA-1 implementation. It seems python generates the
> > same JIT code for each function. They will become separate DSOs but the
> > contents are the same. Only difference is in the symbol name.
> >
> > But this caused a problem that every JIT'ed DSOs will have the same
> > build-ID which makes perf confused. And it resulted in no python
> > symbols (from JIT) in the output.
> >
> > Looking back at the original code before the conversion, it used the
> > load_addr as well as the code section to distinguish each DSO. But it'd
> > be better to use contents of symtab and strtab instead as it aligns with
> > some linker behaviors.
> >
> > This patch adds a buffer to save all the contents in a single place for
> > SHA-1 calculation. Probably we need to add sha1_update() or similar to
> > update the existing hash value with different contents and use it here.
> > But it's out of scope for this change and I'd like something that can be
> > backported to the stable trees easily.
> >
> > Fixes: e3f612c1d8f3945b ("perf genelf: Remove libcrypto dependency and use built-in sha1()")
> > Cc: Eric Biggers <ebiggers@...nel.org>
> > Cc: Pablo Galindo <pablogsal@...il.com>
> > Cc: Fangrui Song <maskray@...rceware.org>
> > Link: https://github.com/python/cpython/issues/139544
> > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
>
> That commit actually preserved the behavior of the existing variant of
> gen_build_id() that was under #ifdef BUILD_ID_SHA. So I guess that code
> was always broken, and it was just never noticed because the alternative
> variant of gen_build_id() under #ifdef BUILD_ID_MD5 was used instead?
>
> The MD5 variant of gen_build_id() just hashed the load_addr concatenated
> with the code. That's not what this patch does, though. So just to
> clarify, you'd actually like to go with a third approach rather than
> just restoring the original hash(load_addr || code) approach?
>
> Also, I missed that you had actually changed the hash algorithm. I had
> assumed the perf folks were were pushing SHA-1 because they were already
> using it. Given that the algorithm changed, there must not be any
> backwards compatibility concerns here, and you should switch to a modern
> hash algorithm such as SHA-256 instead.
>
> I'd be glad to add an incremental API if you need it, but I'm confused
> why you want SHA-1 and not a modern hash algorithm.
Hi Eric,
Thanks for the help with the hash functions! There's a bit more
context in this thread:
https://lore.kernel.org/linux-perf-users/CAP-5=fWLgaWsv82dcPajVk=UmBbmwyEd7OVp6psZQ4TiXh-Meg@mail.gmail.com/
So genelf is trying to take snippets of jitted code and create ELF
files from them for the purpose of symbolizing in perf. The buildid
hash being used is SHA1 and I think the MD5 support was removed as
unnecessary. The problem this patch is addressing is that a JIT may
create many identical stubs which then end up being deduplicated into
the same buildid as only the code is hashed. The BFD linker seems to
have the same issue (Fangrui filed a bug), gold and lld appear to hash
the symbols (which Namhyung adds to genelf here) but still yield
different build id for the same source assembly code. It is possible
to hash the address of the symbol rather than the symbol itself, but I
think the intent for the code should be to best match what a compiler
and linker would generate. The problem there is that this differs for
every linker :-)
Something that is unfortunate in the code now is copying/concatenating
all the build data for the sake of producing the hash. It would be
nice if the code could incrementally build up the sha1 hash to avoid
the copying. I don't know if there is functionality for this
currently.
Thanks,
Ian
> - Eric
Powered by blists - more mailing lists