linux-kernel - Re: BUG: unable to handle page fault for address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5a1f5612.363e.196df64bd1f.Coremail.00107082@163.com>
Date: Sun, 18 May 2025 01:57:23 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Suren Baghdasaryan" <surenb@...gle.com>
Cc: kent.overstreet@...ux.dev, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: BUG: unable to handle page fault for address



At 2025-05-18 01:29:35, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@....com> wrote:
>>
>>
>> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@....com> wrote:
>> >>
>> >>
>> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>> >> >>
>> >> >> Hi David,
>> >> >>
>> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@....com> wrote:
>> >> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > I caught a page fault when I was changing my nvidia driver:
>> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
>> >> >> >
>> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> >> >> > ...
>> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> >> >> > ...
>> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
>> >> >> > [Fri May 16 12:05:41 2025]  <TASK>
>> >> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>> >> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>> >> >> > ...
>> >> >> > (full kernel logs are pasted at the end.)
>> >> >> >
>> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
>> >> >> >
>> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
>> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
>> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
>> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>> >> >> >
>> >> >> > The offending lines in my codebase:
>> >> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>> >> >> >         127 {
>> >> >> >         ...
>> >> >> >         131
>> >> >> >         132         for_each_possible_cpu(cpu) {
>> >> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
>> >> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
>> >> >> >         135                 v.calls += counter->calls;
>> >> >> >
>> >> >> >
>> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
>> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
>> >> >> > Based on following defination:
>> >> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
>> >> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
>> >> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
>> >> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
>> >> >> >         112                 .ct = CODE_TAG_INIT,                                            \
>> >> >> >         113                 .counters = &_alloc_tag_cntr };
>> >> >> >         114
>> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
>> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
>> >> >> > right?
>> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
>> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
>> >> >> > or even during module is unloading?
>> >> >>
>> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
>> >> >> would free the per-cpu memory allocated for the module. Before
>> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
>> >> >> memory") we would not unload the module if there were tags which were
>> >> >> still in use. After that change we load module tags into separate
>> >> >> memory, so I expected this to work but due to this external reference
>> >> >> it indeed should lead to UAF.
>> >> >> I think the simplest way to fix this would be to bypass
>> >> >> percpu_modfree() inside free_module() when there are module tags still
>> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
>> >> >> it inside clean_unused_module_areas_locked() once we know the counters
>> >> >> are not used anymore. I'll take a stab at it and will send a patch for
>> >> >> testing today.
>> >> >
>> >> >Ok, I went with another implementation, instead dynamically allocating
>> >> >percpu memory for modules at the module load time. This has another
>> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
>> >> >required for memory allocation tagging to work.
>> >> >David, the patch is posted at [1]. Please give it a try and let me
>> >> >know if the fix works for you.
>> >> >Thanks,
>> >> >Suren.
>> >> >
>> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
>> >> >
>> >> >
>> >> >> Thanks,
>> >> >> Suren.
>> >> >>
>> >>
>> >> Hi, the patch does fix my issue.
>> >> I now have another similar concern about modules RO data,
>> >> The codetag defined as
>> >>  24 struct codetag {
>> >>  25         unsigned int flags; /* used in later patches */
>> >>  26         unsigned int lineno;
>> >>  27         const char *modname;
>> >>  28         const char *function;
>> >>  29         const char *filename;
>> >>  30 } __aligned(8);
>> >>
>> >> Those modname/function/filename would refer to RO data section, right?
>> >> When module unloaded, its RO data section would be released at some point.
>> >> My question is is it safe to use RO data during module unload? because these
>> >> lines seems to access those data:
>> >>
>> >> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
>> >> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
>> >> +                               tag->ct.function, counter.bytes);
>> >
>> >These lines are called from release_module_tags() using this call chain:
>> >
>> >delete_module()
>> >  free_module()
>> >    codetag_unload_module()
>> >      release_module_tags()
>> >
>> >and codetag_unload_module() is called at the very beginning of
>> >free_module(), when no other module memory has been freed yet. So,
>> >from what I understand, this should be safe.
>> >After we unload the module these pointers inside the tags will be
>> >dandling but we should not be using them anymore since we do not
>> >report unloaded modules. Do you see some usage that I missed?
>>
>> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
>> raised from release_module_tags().
>>
>> Is data..percpu. section released earlier?
>
>No but counters are different because the allocations that still
>reference these tags from unloaded modules will be decrementing them
>when they are freed. That's where UAF is coming from. So, the counters
>might be accessed after the module is unloaded, other fields should
>not.
>

I do notice there are places where counters are referenced "after" free_module, but the logs I attached
happened "during" free_module():

 [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
 [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
[Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
...
 [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
...
 [Fri May 16 12:05:41 2025] Call Trace:
 [Fri May 16 12:05:41 2025]  <TASK>
 [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
[Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0

The call chain is the same as you mentioned above.
This part confusing me a lot. The log indicates during free_module,  ..data..percpu access failed,
I doubted those section would be released that quick.  

The only guess left  I feel reasonable is the ..data_percpu was not paged in at all,  probably  because no access to it,
and when the section is accessed during free_module, somehow the access is refused. Just guessing.....

Or,  do I missing something here?


>>
>> Thanks
>> David
>>
>>