[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20251127221822.GA2977@sol>
Date: Thu, 27 Nov 2025 14:18:22 -0800
From: Eric Biggers <ebiggers@...nel.org>
To: Namhyung Kim <namhyung@...nel.org>
Cc: Ian Rogers <irogers@...gle.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
James Clark <james.clark@...aro.org>, Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
linux-perf-users@...r.kernel.org,
Pablo Galindo <pablogsal@...il.com>,
Fangrui Song <maskray@...rceware.org>
Subject: Re: [PATCH v2 1/2] perf jitdump: Add sym/str-tables to build-ID
generation
On Thu, Nov 27, 2025 at 01:17:01PM -0800, Namhyung Kim wrote:
> Hello,
>
> On Tue, Nov 25, 2025 at 07:04:52PM -0800, Eric Biggers wrote:
> > On Tue, Nov 25, 2025 at 06:55:39PM -0800, Ian Rogers wrote:
> > > On Tue, Nov 25, 2025 at 11:29 AM Eric Biggers <ebiggers@...nel.org> wrote:
> > > >
> > > > On Tue, Nov 25, 2025 at 12:07:46AM -0800, Namhyung Kim wrote:
> > > > > It was reported that python backtrace with JIT dump was broken after the
> > > > > change to built-in SHA-1 implementation. It seems python generates the
> > > > > same JIT code for each function. They will become separate DSOs but the
> > > > > contents are the same. Only difference is in the symbol name.
> > > > >
> > > > > But this caused a problem that every JIT'ed DSOs will have the same
> > > > > build-ID which makes perf confused. And it resulted in no python
> > > > > symbols (from JIT) in the output.
> > > > >
> > > > > Looking back at the original code before the conversion, it used the
> > > > > load_addr as well as the code section to distinguish each DSO. But it'd
> > > > > be better to use contents of symtab and strtab instead as it aligns with
> > > > > some linker behaviors.
> > > > >
> > > > > This patch adds a buffer to save all the contents in a single place for
> > > > > SHA-1 calculation. Probably we need to add sha1_update() or similar to
> > > > > update the existing hash value with different contents and use it here.
> > > > > But it's out of scope for this change and I'd like something that can be
> > > > > backported to the stable trees easily.
> > > > >
> > > > > Fixes: e3f612c1d8f3945b ("perf genelf: Remove libcrypto dependency and use built-in sha1()")
> > > > > Cc: Eric Biggers <ebiggers@...nel.org>
> > > > > Cc: Pablo Galindo <pablogsal@...il.com>
> > > > > Cc: Fangrui Song <maskray@...rceware.org>
> > > > > Link: https://github.com/python/cpython/issues/139544
> > > > > Signed-off-by: Namhyung Kim <namhyung@...nel.org>
> > > >
> > > > That commit actually preserved the behavior of the existing variant of
> > > > gen_build_id() that was under #ifdef BUILD_ID_SHA. So I guess that code
> > > > was always broken, and it was just never noticed because the alternative
> > > > variant of gen_build_id() under #ifdef BUILD_ID_MD5 was used instead?
> > > >
> > > > The MD5 variant of gen_build_id() just hashed the load_addr concatenated
> > > > with the code. That's not what this patch does, though. So just to
> > > > clarify, you'd actually like to go with a third approach rather than
> > > > just restoring the original hash(load_addr || code) approach?
> > > >
> > > > Also, I missed that you had actually changed the hash algorithm. I had
> > > > assumed the perf folks were were pushing SHA-1 because they were already
> > > > using it. Given that the algorithm changed, there must not be any
> > > > backwards compatibility concerns here, and you should switch to a modern
> > > > hash algorithm such as SHA-256 instead.
> > > >
> > > > I'd be glad to add an incremental API if you need it, but I'm confused
> > > > why you want SHA-1 and not a modern hash algorithm.
> > >
> > > Hi Eric,
> > >
> > > Thanks for the help with the hash functions! There's a bit more
> > > context in this thread:
> > > https://lore.kernel.org/linux-perf-users/CAP-5=fWLgaWsv82dcPajVk=UmBbmwyEd7OVp6psZQ4TiXh-Meg@mail.gmail.com/
> > >
> > > So genelf is trying to take snippets of jitted code and create ELF
> > > files from them for the purpose of symbolizing in perf. The buildid
> > > hash being used is SHA1 and I think the MD5 support was removed as
> > > unnecessary. The problem this patch is addressing is that a JIT may
> > > create many identical stubs which then end up being deduplicated into
> > > the same buildid as only the code is hashed. The BFD linker seems to
> > > have the same issue (Fangrui filed a bug), gold and lld appear to hash
> > > the symbols (which Namhyung adds to genelf here) but still yield
> > > different build id for the same source assembly code. It is possible
> > > to hash the address of the symbol rather than the symbol itself, but I
> > > think the intent for the code should be to best match what a compiler
> > > and linker would generate. The problem there is that this differs for
> > > every linker :-)
> > >
> > > Something that is unfortunate in the code now is copying/concatenating
> > > all the build data for the sake of producing the hash. It would be
> > > nice if the code could incrementally build up the sha1 hash to avoid
> > > the copying. I don't know if there is functionality for this
> > > currently.
> >
> > Again, I can add support for incremental hashing if you need it. But I
> > don't understand why you want the hash function to be SHA-1. Also,
> > given that this seems to be a regression fix, it's surprising that
> > you're suddenly changing the inputs to the hash entirely, instead of
> > just going back to hashing load_addr concatenated with the code for now.
>
> Historically MD5 or SHA1 was the only choice for build-ID in linkers.
> I don't know if it's changed lately though. But because of that we
> support build-IDs up to 20 bytes and it's in the UAPI too (well.. it's
> implicit in PERF_RECORD_MMAP2). I think we can support other hashs if
> they fit into it, but SHA-1 would be the right choice for now.
>
> About load_addr, I don't think it's the right fix. We cannot have the
> exactly same build-IDs as before since we already switched to SHA-1.
> What we actually want is a way to generate unique IDs for each DSO.
> For that purpose, it'd be nice to use sym/str-tables as Fangrui said
> instead of using load_addr which seems to be a quick hack in the past.
>
> So I think I can merge this change for stable fix. And add incremental
> SHA-1 (thanks in advance!) and update the code to it later.
If it needs to be 20 bytes, you should just use a truncated SHA-256.
The only use for SHA-1 is backwards compatibility. Which it seems you
don't need.
- Eric
Powered by blists - more mailing lists