linux-kernel - Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling format

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a11ddf44-e553-4a9c-a70f-824295317d4f@linux.intel.com>
Date: Tue, 20 Jan 2026 14:46:08 +0800
From: "Mi, Dapeng" <dapeng1.mi@...ux.intel.com>
To: Ian Rogers <irogers@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
 Dave Hansen <dave.hansen@...ux.intel.com>,
 Adrian Hunter <adrian.hunter@...el.com>, Jiri Olsa <jolsa@...nel.org>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Andi Kleen <ak@...ux.intel.com>, Eranian Stephane <eranian@...gle.com>,
 Mark Rutland <mark.rutland@....com>, broonie@...nel.org,
 Ravi Bangoria <ravi.bangoria@....com>, linux-kernel@...r.kernel.org,
 linux-perf-users@...r.kernel.org, Zide Chen <zide.chen@...el.com>,
 Falcon Thomas <thomas.falcon@...el.com>, Dapeng Mi <dapeng1.mi@...el.com>,
 Xudong Hao <xudong.hao@...el.com>, Kan Liang <kan.liang@...ux.intel.com>
Subject: Re: [Patch v5 18/19] perf parse-regs: Support new SIMD sampling
 format


On 1/20/2026 1:16 PM, Ian Rogers wrote:
> On Mon, Jan 19, 2026 at 7:05 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>>
>> On 1/20/2026 4:25 AM, Ian Rogers wrote:
>>> On Sun, Jan 18, 2026 at 10:55 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>>>> On 1/17/2026 1:50 PM, Ian Rogers wrote:
>>>>> On Mon, Jan 5, 2026 at 11:27 PM Mi, Dapeng <dapeng1.mi@...ux.intel.com> wrote:
>>>>>> Ian,
>>>>>>
>>>>>> I looked at these perf regs __weak helpers again, like
>>>>>> arch__intr_reg_mask()/arch__user_reg_mask(). It could be really hard to
>>>>>> eliminate these __weak helpers and convert them into a generic function
>>>>>> like perf_reg_name(). All these __weak helpers are arch-dependent and
>>>>>> usually need to call perf_event_open sysctrl to get the required registers
>>>>>> mask. So even we convert them into a generic function, we still have no way
>>>>>> to get the registers mask of a different arch, like get x86 registers mask
>>>>>> on arm machine. Another reason is that these __weak helpers may contain
>>>>>> some arch-specific instructions. If we want to convert them into a general
>>>>>> perf function like perf_reg_name(). It may cause building error since these
>>>>>> arch-specific instructions may not exist on the building machine.
>>>>> Hi Dapeng,
>>>>>
>>>>> There was already a patch to better support cross architecture
>>>>> libdw-unwind-ing and I've just sent out a series to clean this up so
>>>>> that this is achieved by having mapping functions between perf and
>>>>> dwarf register names. The functions use the e_machine of the binary to
>>>>> determine how to map, etc. The series is here:
>>>>> https://lore.kernel.org/lkml/20260117052849.2205545-1-irogers@google.com/
>>>>> and I think it can be the foundation for avoiding the weak functions.
>>>> Hi Ian,
>>>>
>>>> Thanks for the reference patch. But they are different. The reference
>>>> patches mainly parse the regs from perf.data and the __weak functions can
>>>> be eliminated in the parsing phase since the registers bitmap is fixed for
>>>> a fixed arch. While these __weak functions
>>>> arch__intr_reg_mask()/arch__user_reg_mask() are used to obtain the support
>>>> sampling registers on a specific platform.
>>>>
>>>> We know different platforms even for same arch may support different
>>>> registers, e.g., some x86 platforms may only support XMM registers, but
>>>> some others may support XMM/YMM/ZMM registers, then all these arch-specific
>>>> arch__intr_reg_mask()/arch__user_reg_mask() functions have to depend on the
>>>> perf_event_open() syscall to retrieve the supported registers mask from kernel.
>>>>
>>>> Thus, it becomes impossible to retrieve the supported registers mask for a
>>>> x86 specific platform from running on a arm platform.
>>>>
>>>> Even we don't consider this limitation and forcibly convert the
>>>> __weak arch__intr_reg_mask() function to some kind of below function, just
>>>> like currently what perf_reg_name() does.
>>>>
>>>> uint64_t perf_intr_reg_mask(const char *arch)
>>>> {
>>>>     uint64_t mask = 0;
>>>>
>>>>     if (!strcmp(arch, "csky"))
>>>>         mask = perf_intr_reg_mask_csky(id);
>>>>     else if (!strcmp(arch, "loongarch"))
>>>>         mask = perf_intr_reg_mask_loongarch(id);
>>>>     else if (!strcmp(arch, "mips"))
>>>>         mask = perf_intr_reg_mask_mips(id);
>>>>     else if (!strcmp(arch, "powerpc"))
>>>>         mask = perf_intr_reg_mask_powerpc(id);
>>>>     else if (!strcmp(arch, "riscv"))
>>>>         mask = perf_intr_reg_mask_riscv(id);
>>>>     else if (!strcmp(arch, "s390"))
>>>>         mask = perf_intr_reg_mask_s390(id);
>>>>     else if (!strcmp(arch, "x86"))
>>>>         mask = perf_intr_reg_mask_x86(id);
>>>>     else if (!strcmp(arch, "arm"))
>>>>         mask = perf_intr_reg_mask_arm(id);
>>>>     else if (!strcmp(arch, "arm64"))
>>>>         mask = perf_intr_reg_mask_arm64(id);
>>>>
>>>>     return mask;
>>>> }
>>>>
>>>> But currently there are some arch-dependent instructions in these
>>>> arch-specific instructions, like the below code in powerpc specific
>>>> arch__intr_reg_mask().
>>>>
>>>>     version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);
>>>>
>>>> mfspr is a powerpc specific instruction, building this converted
>>>> perf_intr_reg_mask on non-powerpc platform would lead to building error.
>>> Hi Dapeng,
>>>
>>> So my main point is the arch directory and ifdefs, how do they differ
>>> from writing code that uses the ELF machine? For example, your code
>>> uses the arch/x86 directory and has ifdefs on
>>> HAVE_ARCH_X86_64_SUPPORT. How is that different from:
>>> ```
>>> switch(e_machine) {
>>> case EM_X86_64:
>>> ...
>>> case EM_I386:
>>> ...
>>> default:
>>> return 0;
>>> }
>>> ```
>>> If we need to determine for the current running machine then e_machine
>>> can equal EM_HOST that is set up for this purpose.
>> I think the key factor that determines if we can convert the code into
>> above e_machine switch ... case format is whether the code is
>> architecture-dependent both in building and execution phases.
>>
>> If the code is not architecture-dependent, It's good to covert the code
>> into the e_machine switch ... case and that would provide better applicability.
>>
>> Otherwise, the architecture-dependent code would lead to the building error
>> (building phase) or get incorrect execution results (execution phase).
>>
>> Even if we introduce EM_HOST case, it won't really solve the building
>> error,  instead it may introduce new building error, e.g.,
>>
>> ```
>> switch(e_machine) {
>> case EM_HOST:
>> ...
>> case EM_X86_64:
>> ...
>> case EM_I386:
>> ...
>> default:
>> return 0;
>> }
>> ```
> No, you wouldn't put EM_HOST as a case statement ever. It is the value
> of the ELF machine you are compiling upon, so either EM_X86_64 or
> EM_I386 here. You would make `e_machine = EM_HOST` were you to want
> some which is specific to the host you are compiling upon. ie `if
> (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) { ... }` would be
> equivalent to `#ifdef __x86_64__` or equivalent to putting code into
> the arch/x86 directory.

The difference between  `if (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) {
... }` with  `#ifdef __x86_64__` or equivalent __weak functions is on the
building phase. All the "if (EM_HOST == EM_X86_64 || EM_HOST == EM_I386) {
... }" code would be built on any architectural platforms, i.e., the x86
specific code could need to be build on ARM platform. If there are
arch-specific code, then the building would fail, e.g., there are below
code in powerpc specific arch__intr_reg_mask() function.

    version = (((mfspr(SPRN_PVR)) >>  16) & 0xFFFF);

mfspr is a powerpc specific instruction and it would get building error if building on the other arch platforms.

To avoid the building error, we have to introduce the #ifdef again.


I ever tried to covert the __weak arch__intr_reg_mask() functions into the below function.

```
uint64_t perf_intr_reg_mask(const char *arch)
{
    uint64_t mask = 0;

    if (!strcmp(arch, "csky"))
        mask = perf_intr_reg_mask_csky(arch);
    else if (!strcmp(arch, "loongarch"))
        mask = perf_intr_reg_mask_loongarch(arch);
    else if (!strcmp(arch, "mips"))
        mask = perf_intr_reg_mask_mips(arch);
    else if (!strcmp(arch, "powerpc"))
        mask = perf_intr_reg_mask_powerpc(arch);
    else if (!strcmp(arch, "riscv"))
        mask = perf_intr_reg_mask_riscv(arch);
    else if (!strcmp(arch, "s390"))
        mask = perf_intr_reg_mask_s390(arch);
    else if (!strcmp(arch, "x86"))
        mask = perf_intr_reg_mask_x86(arch);
    else if (!strcmp(arch, "arm"))
        mask = perf_intr_reg_mask_arm(arch);
    else if (!strcmp(arch, "arm64"))
        mask = perf_intr_reg_mask_arm64(arch);

    return mask;
}
```

But this causes the building error for the perf_intr_reg_mask_powerpc() on x86 platform since the powerpc specific mfspr instruction. A way to fix the building error is to introduce the "#ifdef __powerpc__", maybe like this,

#ifdef __powerpc__

uint64_t perf_intr_reg_mask(const char *arch)
{
	...
}

#else

uint64_t perf_intr_reg_mask(const char *arch)
{
	return 0;
}
#endif

Do you think it's a correct way to handle the issue?

>
>> Assume the code is built on a x86_64 machine, then EM_HOST equals
>> EM_X86_64, that would cause the "duplicate case value" building error.
>>
>> If we want to limit the architecture-dependent code is built only on the
>> correct architecture, then we still have to introduce the architecture
>> #ifdefs. This is actually no difference with current arch directory __weak
>> functions and make it more complex.
> If we have arch functions for  arch__user_simd_reg_mask then why would
> code go through the usual means to determine what the mask is? The
> normal means is to get a sample event, use evsel__parse_sample and
> then from the sample access struct regs_dump that has within it keeps
> a copy of the mask from perf_event_attr associated with the evsel.
>
> In your next patch you do:
> https://lore.kernel.org/lkml/20251203065500.2597594-20-dapeng1.mi@linux.intel.com/
> ```
> ...
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> ...
> +static void simd_regs_dump__printf(struct regs_dump *regs, bool intr)
> +{
> ...
> + if (intr)
> + arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
> + else
> + arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
> ...
>  ```
> The session code is generic, it may be dealing with live machine data
> or with a perf.data file from anywhere. The reason this patch is
> exposing these weak functions is for the next patch, there's not a use
> here. But why isn't the next patch using struct regs_dump? The struct
> regs_dump was set up with sample event and perf_event_attr on hand.
> The evsel__parse_sample logic should likely set up a qwords variable
> so the generic code can just do:
> ```
> qwords = regs->qwords;
> ```
> The parsing logic should be able to do "nr_vectors * vector_qwords +
> nr_pred * pred_qwords" from the perf_event_attr no?

The arch__intr_simd_reg_bitmap_qwords() are also used in this patch,  like
the below code

```

static void __print_simd_regs(bool intr, uint64_t simd_mask)
{
    const struct sample_reg *r = NULL;
    uint64_t bitmap = 0;
    u16 qwords = 0;
    int reg_idx;

    if (!simd_mask)
        return;

    for (r = arch__sample_simd_reg_masks(); r->name; r++) {
        if (!(r->mask & simd_mask))
            continue;
        reg_idx = fls64(r->mask) - 1;
        if (intr)
            bitmap = arch__intr_simd_reg_bitmap_qwords(reg_idx, &qwords);
        else
            bitmap = arch__user_simd_reg_bitmap_qwords(reg_idx, &qwords);
        if (bitmap)
            fprintf(stderr, "%s0-%d ", r->name, fls64(bitmap) - 1);
    }
}

```

But I agree this can be convert some kind of generic functions. Let me try
to do it in next version.

>
>>> I agree that determining features needs calls that may not be
>>> supported on other architectures. That should yield EOPNOTSUPP and we
>>> can use information like that to populate generic information like the
>>> PMU missing features:
>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/pmu.h?h=perf-tools-next#n190
>>> we also probe API support with:
>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/perf_api_probe.h?h=perf-tools-next
>> In general, I agree we can return EOPNOTSUPP or some generic information
>> for some architecture independent code. But it's not applicable for these 2
>> specific arch__intr_reg_mask()/arch__user_reg_mask() functions, current
>> perf code depends on these 2 functions to return the supported registers
>> mask on a specific (running) platform.
> You can't put code into generic code like session and assume the
> perf.data is for the running machine, so the design is wrong.

Ok, let me try to convert them a generic function.


>
>>> The current code doing lots of string comparisons is unnecessary
>>> overhead and imprecise (x86 is used for both 32-bit and 64-bit x86).
>>> It is removed in the series I linked to, I think we can eventually get
>>> rid of the whole arch string for similar reasons of trying to minimize
>>> the use of the arch directory. I'm curious what happens with APX, will
>>> the e_machine change? We may need to pass in the sample regs_dump's
>>> abi field for cases like this.
>> Yes, I agree we should git rid of the arch-string comparison and minimize
>> the use of arch directory. It would improve the efficiency.
>>
>> I don't think the support of APX would change the e_machine, it should
>> still be EM_X86_64.
>>
>> Yes, we need the abi filed (exactly PERF_SAMPLE_REGS_ABI_SIMD) to determine
>> it's APX or legacy XMM.
> Right, in my (unmerged) code to map a perf register to a dwarf register:
> https://lore.kernel.org/lkml/20260117052849.2205545-13-irogers@google.com/
> we'll need the abi field.

Let me try to add the change in the next version.


>
>>> My point on the unwinding is that the sample register mask appears to
>>> be set up the same regardless, whereas for stack samples
>>> (--call-graph=dwarf) maybe just sample IP and SP suffices. So perhaps
>>> there should be additional registers to set up the sample mask.
>> Yes, that's true. It can be further optimized.
>>
>>
>>> By avoiding the arch functions we can avoid the problem of broken
>>> cross architecture support, we can also lay the groundwork for support
>>> on different architectures that may want to do similar things. I agree
>>> that doesn't matter until >1 architecture is trying to have more
>>> register masks, my concern is trying to keep the code generic and
>>> trying to make sure cross architecture is working. New weak functions
>>> is going in the opposite direction to that.
>> Yes, I agree we should git rid of these arch functions as much as possible.
>> But for these architecture dependent code (as above shows), it seems the
>> __weak functions are still the simplest and best way to handle them.
> So I don't think we should be putting functions that assume the
> running machine into generic code like session. The arch functions
> create a shortcut that avoids looking at the perf_event_attr,
> differences between EM_I386 and EM_X86_64, etc. I'm not sure simpler
> matters here, the code is just incorrect relative to how things are
> being done around it. How do I grab registers on an APX capable
> machine and then dump it on my non-APX laptop? How do the arch
> functions account for the differences between EM_I386 and EM_X86_64,
> both of which process types may be running on the machine at the same
> time and having samples show up in system-wide mode? Having the arch
> functions lets things be done wrong and the patch series shows that in
> the very next patch.
>
> Thanks,
> Ian
>
>> Thanks.
>>
>>> Thanks,
>>> Ian
>>>
>>>> -Dapeng Mi
>>>>
>>>>> I also noticed that I think we're sampling the XMM registers for dwarf
>>>>> unwinding, but it seems unlikely the XMM registers will hold stack
>>>>> frame information - so this is probably an x86 inefficiency.
>>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>