[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fXNTSA4wGW+HWS_UdV-iR940A1Td9wgyHOsVbhM0QjE3g@mail.gmail.com>
Date: Mon, 19 Jan 2026 21:16:12 -0800
From: Ian Rogers <irogers@...gle.com>
To: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
Adrian Hunter <adrian.hunter@...el.com>, Jiri Olsa <jolsa@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Andi Kleen <ak@...ux.intel.com>,
Eranian Stephane <eranian@...gle.com>, Mark Rutland <mark.rutland@....com>, broonie@...nel.org,
Ravi Bangoria <ravi.bangoria@....com>, linux-kernel@...r.kernel.org,
linux-perf-users@...r.kernel.org, Zide Chen <zide.chen@...el.com>,
Falcon Thomas <thomas.falcon@...el.com>, Dapeng Mi <dapeng1.mi@...el.com>,
Xudong Hao <xudong.hao@...el.com>, Kan Liang <kan.liang@...ux.intel.com>
Subject: Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format
On Mon, Jan 19, 2026 at 7:05 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>
>
> On 1/20/2026 4:25 AM, Ian Rogers wrote:
> > On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
> >>
> >> On 1/17/2026 1:50 PM, Ian Rogers wrote:
> >>> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
> >>>> Ian,
> >>>>
> >>>> I looked at these perf regs __weak helpers again, like
> >>>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
> >>>> eliminate these __weak helpers and convert them into a generic function
> >>>> like perf_reg_name(). All these __weak helpers are arch-dependent and
> >>>> usually need to call perf_event_open sysctrl to get the required registers
> >>>> mask. So even we convert them into a generic function, we still have no way
> >>>> to get the registers mask of a different arch, like get x86 registers mask
> >>>> on arm machine. Another reason is that these __weak helpers may contain
> >>>> some arch-specific instructions. If we want to convert them into a general
> >>>> perf function like perf_reg_name(). It may cause building error since these
> >>>> arch-specific instructions may not exist on the building machine.
> >>> Hi Dapeng,
> >>>
> >>> There was already a patch to better support cross architecture
> >>> libdw-unwind-ing and I've just sent out a series to clean this up so
> >>> that this is achieved by having mapping functions between perf and
> >>> dwarf register names. The functions use the e_machine of the binary to
> >>> determine how to map, etc. The series is here:
> >>> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
> >>> and I think it can be the foundation for avoiding the weak functions.
> >> Hi Ian,
> >>
> >> Thanks for the reference patch. But they are different. The reference
> >> patches mainly parse the regs from perf.data and the __weak functions can
> >> be eliminated in the parsing phase since the registers bitmap is fixed for
> >> a fixed arch. While these __weak functions
> >> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
> >> sampling registers on a specific platform.
> >>
> >> We know different platforms even for same arch may support different
> >> registers, e.g., some x86 platforms may only support XMM registers, but
> >> some others may support XMM/YMM/ZMM registers, then all these arch-specific
> >> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
> >> perf_event_open() syscall to retrieve the supported registers mask from kernel.
> >>
> >> Thus, it becomes impossible to retrieve the supported registers mask for a
> >> x86 specific platform from running on a arm platform.
> >>
> >> Even we don't consider this limitation and forcibly convert the
> >> __weak arch__intr_reg_mask() function to some kind of below function, just
> >> like currently what perf_reg_name() does.
> >>
> >> uint64_t perf_intr_reg_mask(const char *arch)
> >> {
> >> uint64_t mask = 0;
> >>
> >> if (!strcmp(arch, "csky"))
> >> mask = perf_intr_reg_mask_csky(id);
> >> else if (!strcmp(arch, "loongarch"))
> >> mask = perf_intr_reg_mask_loongarch(id);
> >> else if (!strcmp(arch, "mips"))
> >> mask = perf_intr_reg_mask_mips(id);
> >> else if (!strcmp(arch, "powerpc"))
> >> mask = perf_intr_reg_mask_powerpc(id);
> >> else if (!strcmp(arch, "riscv"))
> >> mask = perf_intr_reg_mask_riscv(id);
> >> else if (!strcmp(arch, "s390"))
> >> mask = perf_intr_reg_mask_s390(id);
> >> else if (!strcmp(arch, "x86"))
> >> mask = perf_intr_reg_mask_x86(id);
> >> else if (!strcmp(arch, "arm"))
> >> mask = perf_intr_reg_mask_arm(id);
> >> else if (!strcmp(arch, "arm64"))
> >> mask = perf_intr_reg_mask_arm64(id);
> >>
> >> return mask;
> >> }
> >>
> >> But currently there are some arch-dependent instructions in these
> >> arch-specific instructions, like the below code in powerpc specific
> >> arch__intr_reg_mask().
> >>
> >> version = (((mfspr(SPRN_PVR)) >> 16) & 0xFFFF);
> >>
> >> mfspr is a powerpc specific instruction, building this converted
> >> perf_intr_reg_mask on non-powerpc platform would lead to building error.
> > Hi Dapeng,
> >
> > So my main point is the arch directory and ifdefs, how do they differ
> > from writing code that uses the ELF machine? For example, your code
> > uses the arch/x86 directory and has ifdefs on
> > HAVE_ARCH_X86_64_SUPPORT. How is that different from:
> > ```
> > switch(e_machine) {
> > case EM_X86_64:
> > ...
> > case EM_I386:
> > ...
> > default:
> > return 0;
> > }
> > ```
> > If we need to determine for the current running machine then e_machine
> > can equal EM_HOST that is set up for this purpose.
>
> I think the key factor that determines if we can convert the code into
> above e_machine switch ... case format is whether the code is
> architecture-dependent both in building and execution phases.
>
> If the code is not architecture-dependent, It's good to covert the code
> into the e_machine switch ... case and that would provide better applicability.
>
> Otherwise, the architecture-dependent code would lead to the building error
> (building phase) or get incorrect execution results (execution phase).
>
> Even if we introduce EM_HOST case, it won't really solve the building
> error, instead it may introduce new building error, e.g.,
>
> ```
> switch(e_machine) {
> case EM_HOST:
> ...
> case EM_X86_64:
> ...
> case EM_I386:
> ...
> default:
> return 0;
> }
> ```
No, you wouldn't put EM_HOST as a case statement ever. It is the value
of the ELF machine you are compiling upon, so either EM_X86_64 or
EM_I386 here. You would make `e_machine = EM_HOST` were you to want
some which is specific to the host you are compiling upon. ie `if
(EM_HOST == EM_X86_64 || EM_HOST == EM_I386) { ... }` would be
equivalent to `#ifdef __x86_64__` or equivalent to putting code into
the arch/x86 directory.
> Assume the code is built on a x86_64 machine, then EM_HOST equals
> EM_X86_64, that would cause the "duplicate case value" building error.
>
> If we want to limit the architecture-dependent code is built only on the
> correct architecture, then we still have to introduce the architecture
> #ifdefs. This is actually no difference with current arch directory __weak
> functions and make it more complex.
If we have arch functions for arch__user_simd_reg_mask then why would
code go through the usual means to determine what the mask is? The
normal means is to get a sample event, use evsel__parse_sample and
then from the sample access struct regs_dump that has within it keeps
a copy of the mask from perf_event_attr associated with the evsel.
In your next patch you do:
https://lore.kernel.org/lkml/20251203065500.2597594-20-dapeng1.mi@linux.intel.com/
```
...
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
...
+static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
+{
...
+ if (intr)
+ arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
+ else
+ arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
...
```
The session code is generic, it may be dealing with live machine data
or with a perf.data file from anywhere. The reason this patch is
exposing these weak functions is for the next patch, there's not a use
here. But why isn't the next patch using struct regs_dump? The struct
regs_dump was set up with sample event and perf_event_attr on hand.
The evsel__parse_sample logic should likely set up a qwords variable
so the generic code can just do:
```
qwords = regs->qwords;
```
The parsing logic should be able to do "nr_vectors * vector_qwords +
nr_pred * pred_qwords" from the perf_event_attr no?
> >
> > I agree that determining features needs calls that may not be
> > supported on other architectures. That should yield EOPNOTSUPP and we
> > can use information like that to populate generic information like the
> > PMU missing features:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
> > we also probe API support with:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next
>
> In general, I agree we can return EOPNOTSUPP or some generic information
> for some architecture independent code. But it's not applicable for these 2
> specific arch__intr_reg_mask()/arch__user_reg_mask() functions, current
> perf code depends on these 2 functions to return the supported registers
> mask on a specific (running) platform.
You can't put code into generic code like session and assume the
perf.data is for the running machine, so the design is wrong.
> >
> > The current code doing lots of string comparisons is unnecessary
> > overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
> > It is removed in the series I linked to, I think we can eventually get
> > rid of the whole arch string for similar reasons of trying to minimize
> > the use of the arch directory. I'm curious what happens with APX, will
> > the e_machine change? We may need to pass in the sample regs_dump's
> > abi field for cases like this.
>
> Yes, I agree we should git rid of the arch-string comparison and minimize
> the use of arch directory. It would improve the efficiency.
>
> I don't think the support of APX would change the e_machine, it should
> still be EM_X86_64.
>
> Yes, we need the abi filed (exactly PERF_SAMPLE_REGS_ABI_SIMD) to determine
> it's APX or legacy XMM.
Right, in my (unmerged) code to map a perf register to a dwarf register:
https://lore.kernel.org/lkml/20260117052849.2205545-13-irogers@google.com/
we'll need the abi field.
> >
> > My point on the unwinding is that the sample register mask appears to
> > be set up the same regardless, whereas for stack samples
> > (--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
> > there should be additional registers to set up the sample mask.
>
> Yes, that's true. It can be further optimized.
>
>
> >
> > By avoiding the arch functions we can avoid the problem of broken
> > cross architecture support, we can also lay the groundwork for support
> > on different architectures that may want to do similar things. I agree
> > that doesn't matter until >1 architecture is trying to have more
> > register masks, my concern is trying to keep the code generic and
> > trying to make sure cross architecture is working. New weak functions
> > is going in the opposite direction to that.
>
> Yes, I agree we should git rid of these arch functions as much as possible.
> But for these architecture dependent code (as above shows), it seems the
> __weak functions are still the simplest and best way to handle them.
So I don't think we should be putting functions that assume the
running machine into generic code like session. The arch functions
create a shortcut that avoids looking at the perf_event_attr,
differences between EM_I386 and EM_X86_64, etc. I'm not sure simpler
matters here, the code is just incorrect relative to how things are
being done around it. How do I grab registers on an APX capable
machine and then dump it on my non-APX laptop? How do the arch
functions account for the differences between EM_I386 and EM_X86_64,
both of which process types may be running on the machine at the same
time and having samples show up in system-wide mode? Having the arch
functions lets things be done wrong and the patch series shows that in
the very next patch.
Thanks,
Ian
> Thanks.
>
> >
> > Thanks,
> > Ian
> >
> >> -Dapeng Mi
> >>
> >>> I also noticed that I think we're sampling the XMM registers for dwarf
> >>> unwinding, but it seems unlikely the XMM registers will hold stack
> >>> frame information - so this is probably an x86 inefficiency.
> >>>
> >>> Thanks,
> >>> Ian
> >>>
Powered by blists - more mailing lists