linux-kernel - Re: [REGRESSION] module BTF validation failure (Error -22) on next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87wmfzmi5u.fsf@oracle.com>
Date: Mon, 16 Dec 2024 13:28:45 -0800
From: Stephen Brennan <stephen.s.brennan@...cle.com>
To: Alan Maguire <alan.maguire@...cle.com>,
        Cong Wang
 <xiyou.wangcong@...il.com>, Uros Bizjak <ubizjak@...il.com>
Cc: Laura Nao <laura.nao@...labora.com>, bpf@...r.kernel.org,
        chrome-platform@...ts.linux.dev, kernel@...labora.com,
        linux-kernel@...r.kernel.org, regressions@...ts.linux.dev,
        "dwarves@...r.kernel.org" <dwarves@...r.kernel.org>
Subject: Re: [REGRESSION] module BTF validation failure (Error -22) on next

Alan Maguire <alan.maguire@...cle.com> writes:
> On 14/12/2024 12:15, Alan Maguire wrote:
>> On 14/12/2024 04:41, Cong Wang wrote:
>>> On Thu, Dec 05, 2024 at 08:36:33AM +0100, Uros Bizjak wrote:
>>>> On Wed, Dec 4, 2024 at 4:52 PM Laura Nao <laura.nao@...labora.com> wrote:
>>>>>
>>>>> On 11/15/24 18:17, Laura Nao wrote:
>>>>>> I managed to reproduce the issue locally and I've uploaded the vmlinux[1]
>>>>>> (stripped of DWARF data) and vmlinux.raw[2] files, as well as one of the
>>>>>> modules[3] and its btf data[4] extracted with:
>>>>>>
>>>>>> bpftool -B vmlinux btf dump file cros_kbd_led_backlight.ko > cros_kbd_led_backlight.ko.raw
>>>>>>
>>>>>> Looking again at the logs[5], I've noticed the following is reported:
>>>>>>
>>>>>> [    0.415885] BPF:    type_id=115803 offset=177920 size=1152
>>>>>> [    0.416029] BPF:
>>>>>> [    0.416083] BPF: Invalid offset
>>>>>> [    0.416165] BPF:
>>>>>>
>>>>>> There are two different definitions of rcu_data in '.data..percpu', one
>>>>>> is a struct and the other is an integer:
>>>>>>
>>>>>> type_id=115801 offset=177920 size=1152 (VAR 'rcu_data')
>>>>>> type_id=115803 offset=177920 size=1152 (VAR 'rcu_data')
>>>>>>
>>>>>> [115801] VAR 'rcu_data' type_id=115572, linkage=static
>>>>>> [115803] VAR 'rcu_data' type_id=1, linkage=static
>>>>>>
>>>>>> [115572] STRUCT 'rcu_data' size=1152 vlen=69
>>>>>> [1] INT 'long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)
>>>>>>
>>>>>> I assume that's not expected, correct?
>>>>>>
>>>>>> I'll dig a bit deeper and report back if I can find anything else.
>>>>>
>>>>> I ran a bisection, and it appears the culprit commit is:
>>>>> https://lore.kernel.org/all/20241021080856.48746-2-ubizjak@gmail.com/
>>>>>
>>>>> Hi Uros, do you have any suggestions or insights on resolving this issue?
>>>>
>>>> There is a stray ";" at the end of the #define, perhaps this makes a difference:
>>>>
>>>> +#define PERCPU_PTR(__p) \
>>>> + (typeof(*(__p)) __force __kernel *)(__p);
>>>> +
>>>>
>>>> and SHIFT_PERCPU_PTR macro now expands to:
>>>>
>>>> RELOC_HIDE((typeof(*(p)) __force __kernel *)(p);, (offset))
>>>>
>>>> A follow-up patch in the series changes PERCPU_PTR macro to:
>>>>
>>>> #define PERCPU_PTR(__p) \
>>>> ({ \
>>>> unsigned long __pcpu_ptr = (__force unsigned long)(__p); \
>>>> (typeof(*(__p)) __force __kernel *)(__pcpu_ptr); \
>>>> })
>>>>
>>>> so this should again correctly cast the value.
>>>
>>> Hm, I saw a similar bug but with pahole 1.28. My kernel complains about
>>> BTF invalid offset:
>>>
>>> [    7.785788] BPF: 	 type_id=2394 offset=0 size=1
>>> [    7.786411] BPF:
>>> [    7.786703] BPF: Invalid offset
>>> [    7.787119] BPF:
>>>
>>> Dumping the vmlinux (there is no module invovled), I saw it is related to
>>> percpu pointer too:
>>>
>>> [2394] VAR '__pcpu_unique_cpu_hw_events' type_id=2, linkage=global
>>> ...
>>> [163643] DATASEC '.data..percpu' size=2123280 vlen=808
>>>         type_id=2393 offset=0 size=1 (VAR '__pcpu_scope_cpu_hw_events')
>>>         type_id=2394 offset=0 size=1 (VAR '__pcpu_unique_cpu_hw_events')
>>> ...
>>>
>>> I compiled and installed the latest pahole from its git repo:
>>>
>>> $ pahole --version
>>> v1.28
>>>
>>> Thanks.
>> 
>> Thanks for the report! Looking at percpu-defs.h it looks like the
>> existence of such variables requires either
>> 
>> #if defined(ARCH_NEEDS_WEAK_PER_CPU) ||
>> defined(CONFIG_DEBUG_FORCE_WEAK_PER_CPU)
>> 
>> ...
>> 
>> #define DEFINE_PER_CPU_SECTION(type, name, sec)                         \
>>         __PCPU_DUMMY_ATTRS char __pcpu_scope_##name;                    \
>>         extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;            \
>>         __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;                   \
>>         extern __PCPU_ATTRS(sec) __typeof__(type) name;                 \
>>         __PCPU_ATTRS(sec) __weak __typeof__(type) name
>> 
>> 
>> I'm guessing your .config has CONFIG_DEBUG_FORCE_WEAK_PER_CPU, or are
>> you building on s390/alpha?
>> 
>> I've reproduced this on bpf-next with CONFIG_DEBUG_FORCE_WEAK_PER_CPU=y,
>> pahole v1.28 and gcc-12; I see ~900 __pcpu_ variables and get the same
>> BTF errors since multipe __pcpu_ vars share the offset 0.
>> 
>> A simple workaround in dwarves - and I verified this resolved the issue
>> for me - would be
>> 
>> diff --git a/btf_encoder.c b/btf_encoder.c
>> index 3754884..4a1799a 100644
>> --- a/btf_encoder.c
>> +++ b/btf_encoder.c
>> @@ -2174,7 +2174,8 @@ static bool filter_variable_name(const char *name)
>>                 X("__UNIQUE_ID"),
>>                 X("__tpstrtab_"),
>>                 X("__exitcall_"),
>> -               X("__func_stack_frame_non_standard_")
>> +               X("__func_stack_frame_non_standard_"),
>> +               X("__pcpu_")
>>                 #undef X
>>         };
>>         int i;
>> 
>> ...but I'd like us to understand further why variables which were
>> supposed to be in a .discard section end up being encoded as there may
>> be other problems lurking here aside from this one. More soon hopefully...
>>
>
>
> A bit more context here - variable encoding takes the address of the
> variable from DWARF to locate the associated ELF section. Because we
> insist on having a variable specification - with a location - this
> usually works fine. However the problem is that because these dummy
> __pcpu_ variables specify a .discard section, their addresses are 0, so
> we get for example:
>
>  <1><1e535>: Abbrev Number: 114 (DW_TAG_variable)
>     <1e536>   DW_AT_name        : (indirect string, offset: 0x5e97):
> __pcpu_unique_kstack_offset
>     <1e53a>   DW_AT_decl_file   : 1
>     <1e53b>   DW_AT_decl_line   : 823
>     <1e53d>   DW_AT_decl_column : 1
>     <1e53e>   DW_AT_type        : <0x57>
>     <1e542>   DW_AT_external    : 1
>     <1e542>   DW_AT_declaration : 1
>  <1><1e542>: Abbrev Number: 156 (DW_TAG_variable)
>     <1e544>   DW_AT_specification: <0x1e535>
>     <1e548>   DW_AT_location    : 9 byte block: 3 0 0 0 0 0 0 0 0
> (DW_OP_addr: 0)
>
>
> You can see the same thing for a simple program like this:
>
> #include <stdio.h>
>
> #define SEC(name) __attribute__((section(name)))
>
> SEC("/DISCARD/") int d1;
> extern int d1;
> SEC("/DISCARD/") int d2;
> extern int d2;
>
> int main(int argc, char *argv[])
> {
> 	return 0;
> }
>
>
> If you compile it with -g, the DWARF shows that d1 and d2 both have
> address 0:
>
>  <1><72>: Abbrev Number: 5 (DW_TAG_variable)
>     <73>   DW_AT_name        : d1
>     <76>   DW_AT_decl_file   : 1
>     <77>   DW_AT_decl_line   : 5
>     <78>   DW_AT_decl_column : 22
>     <79>   DW_AT_type        : <0x57>
>     <7d>   DW_AT_external    : 1
>     <7d>   DW_AT_location    : 9 byte block: 3 0 0 0 0 0 0 0 0
> (DW_OP_addr: 0)
>  <1><87>: Abbrev Number: 5 (DW_TAG_variable)
>     <88>   DW_AT_name        : d2
>     <8b>   DW_AT_decl_file   : 1
>     <8c>   DW_AT_decl_line   : 7
>     <8d>   DW_AT_decl_column : 22
>     <8e>   DW_AT_type        : <0x57>
>     <92>   DW_AT_external    : 1
>     <92>   DW_AT_location    : 9 byte block: 3 0 0 0 0 0 0 0 0
> (DW_OP_addr: 0)
>
>
> So the reason this happens for dwarves v1.28 in particular is - as I
> understand it - we moved away from recording ELF section information for
> each variable and matching that with DWARF info, instead relying on the
> address to locate the associated ELF section. In cases like the above
> the address information unfortunately leads us astray.
>
> Seems like there's a few approaches we can take in fixing this:
>
> 1. designate "__pcpu_" prefix as a variable prefix to filter out. This
> resolves the immediate problem but is too narrowly focused IMO and we
> may end up playing whack-a-mole with other dummy variable prefixes.
> 2. resurrect ELF section variable information fully; i.e. record a list
> of variables per ELF section (or at least per ELF section we care
> about). If variable is not on the list for the ELF section, do not
> encode it.
> 3. midway between the two; for the 0 address case specifically, verify
> that the variable name really _is_ in the associated ELF section. No
> need to create a local ELF table variable representation, we could just
> walk the table in the case of the 0 addresses.
>
> Diff for approach 3 is as follows
>
> diff --git a/btf_encoder.c b/btf_encoder.c
> index 3754884..21a0ab6 100644
> --- a/btf_encoder.c
> +++ b/btf_encoder.c
> @@ -2189,6 +2189,26 @@ static bool filter_variable_name(const char *name)
>         return false;
>  }
>
> +bool variable_in_sec(struct btf_encoder *encoder, const char *name,
> size_t shndx)
> +{
> +       uint32_t sym_sec_idx;
> +       uint32_t core_id;
> +       GElf_Sym sym;
> +
> +       elf_symtab__for_each_symbol_index(encoder->symtab, core_id, sym,
> sym_sec_idx) {
> +               const char *sym_name;
> +
> +               if (sym_sec_idx != shndx || elf_sym__type(&sym) !=
> STT_OBJECT)
> +                       continue;
> +               sym_name = elf_sym__name(&sym, encoder->symtab);
> +               if (!sym_name)
> +                       continue;
> +               if (strcmp(name, sym_name) == 0)
> +                       return true;
> +       }
> +       return false;
> +}
> +
>  static int btf_encoder__encode_cu_variables(struct btf_encoder *encoder)
>  {
>         struct cu *cu = encoder->cu;
> @@ -2258,6 +2278,11 @@ static int
> btf_encoder__encode_cu_variables(struct btf_encoder *encoder)
>                 if (filter_variable_name(name))
>                         continue;
>
> +               /* A 0 address may be in a .discard section; ensure the
> +                * variable really is in this section by checking ELF
> symtab.
> +                */
> +               if (addr == 0 && !variable_in_sec(encoder, name, shndx))
> +                       continue;
>                 /* Check for invalid BTF names */
>                 if (!btf_name_valid(name)) {
>                         dump_invalid_symbol("Found invalid variable name
> when encoding btf",
>
>
> ...so slightly more complex than option 1, but a bit more general in its
> applicability to .discard section variables.
>
> For the pahole folks, what do we think? Which option (or indeed other
> ones I haven't thought of) makes sense for a fix for this? Thanks!

Hi Alan,

Thanks so much for the analysis. It strikes me that it would be very
nice if the compiler could somehow annotate DIEs that are discarded. But
maybe the discarding happens far enough along that the DWARF can't be
modified?

In any case, while we wish for better compilers, we need a solution now.
I think Option 3 is a really great compromise. Reading it in context
with the code, it makes intuitive sense and it works when we're
outputting just the percpu globals, or when we're outputting all
globals. And there's little risk of issues, given that up until 1.28,
we've been using the ELF name for all variables anyway, so we know that
there have always been ELF symbols for all the variables we care about.

Stephen