linux-kernel - Re: BUG: unable to handle page fault for address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpEWsN3pxN2EveACvucx7Aip6u_YjJ5u3NxdmU3A3WYOYw@mail.gmail.com>
Date: Sat, 17 May 2025 12:02:37 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: David Wang <00107082@....com>
Cc: kent.overstreet@...ux.dev, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: BUG: unable to handle page fault for address

On Sat, May 17, 2025 at 10:57 AM David Wang <00107082@....com> wrote:
>
>
>
> At 2025-05-18 01:29:35, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
> >On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@....com> wrote:
> >>
> >>
> >> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
> >> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@....com> wrote:
> >> >>
> >> >>
> >> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
> >> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
> >> >> >>
> >> >> >> Hi David,
> >> >> >>
> >> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@....com> wrote:
> >> >> >> >
> >> >> >> > Hi,
> >> >> >> >
> >> >> >> > I caught a page fault when I was changing my nvidia driver:
> >> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
> >> >> >> >
> >> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >> >> >> > ...
> >> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >> >> >> > ...
> >> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
> >> >> >> > [Fri May 16 12:05:41 2025]  <TASK>
> >> >> >> > [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> >> >> >> > [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> >> >> >> > ...
> >> >> >> > (full kernel logs are pasted at the end.)
> >> >> >> >
> >> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
> >> >> >> >
> >> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
> >> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
> >> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
> >> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
> >> >> >> >
> >> >> >> > The offending lines in my codebase:
> >> >> >> >         126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> >> >> >> >         127 {
> >> >> >> >         ...
> >> >> >> >         131
> >> >> >> >         132         for_each_possible_cpu(cpu) {
> >> >> >> >         133                 counter = per_cpu_ptr(tag->counters, cpu);
> >> >> >> > >>>>    134                 v.bytes += counter->bytes;   <--------------here
> >> >> >> >         135                 v.calls += counter->calls;
> >> >> >> >
> >> >> >> >
> >> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
> >> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
> >> >> >> > Based on following defination:
> >> >> >> >         108 #define DEFINE_ALLOC_TAG(_alloc_tag)                                            \
> >> >> >> >         109         static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr);      \
> >> >> >> >         110         static struct alloc_tag _alloc_tag __used __aligned(8)                  \
> >> >> >> >         111         __section(ALLOC_TAG_SECTION_NAME) = {                                   \
> >> >> >> >         112                 .ct = CODE_TAG_INIT,                                            \
> >> >> >> >         113                 .counters = &_alloc_tag_cntr };
> >> >> >> >         114
> >> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
> >> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
> >> >> >> > right?
> >> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
> >> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
> >> >> >> > or even during module is unloading?
> >> >> >>
> >> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
> >> >> >> would free the per-cpu memory allocated for the module. Before
> >> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
> >> >> >> memory") we would not unload the module if there were tags which were
> >> >> >> still in use. After that change we load module tags into separate
> >> >> >> memory, so I expected this to work but due to this external reference
> >> >> >> it indeed should lead to UAF.
> >> >> >> I think the simplest way to fix this would be to bypass
> >> >> >> percpu_modfree() inside free_module() when there are module tags still
> >> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
> >> >> >> it inside clean_unused_module_areas_locked() once we know the counters
> >> >> >> are not used anymore. I'll take a stab at it and will send a patch for
> >> >> >> testing today.
> >> >> >
> >> >> >Ok, I went with another implementation, instead dynamically allocating
> >> >> >percpu memory for modules at the module load time. This has another
> >> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
> >> >> >required for memory allocation tagging to work.
> >> >> >David, the patch is posted at [1]. Please give it a try and let me
> >> >> >know if the fix works for you.
> >> >> >Thanks,
> >> >> >Suren.
> >> >> >
> >> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
> >> >> >
> >> >> >
> >> >> >> Thanks,
> >> >> >> Suren.
> >> >> >>
> >> >>
> >> >> Hi, the patch does fix my issue.
> >> >> I now have another similar concern about modules RO data,
> >> >> The codetag defined as
> >> >>  24 struct codetag {
> >> >>  25         unsigned int flags; /* used in later patches */
> >> >>  26         unsigned int lineno;
> >> >>  27         const char *modname;
> >> >>  28         const char *function;
> >> >>  29         const char *filename;
> >> >>  30 } __aligned(8);
> >> >>
> >> >> Those modname/function/filename would refer to RO data section, right?
> >> >> When module unloaded, its RO data section would be released at some point.
> >> >> My question is is it safe to use RO data during module unload? because these
> >> >> lines seems to access those data:
> >> >>
> >> >> +                       pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
> >> >> +                               tag->ct.filename, tag->ct.lineno, tag->ct.modname,
> >> >> +                               tag->ct.function, counter.bytes);
> >> >
> >> >These lines are called from release_module_tags() using this call chain:
> >> >
> >> >delete_module()
> >> >  free_module()
> >> >    codetag_unload_module()
> >> >      release_module_tags()
> >> >
> >> >and codetag_unload_module() is called at the very beginning of
> >> >free_module(), when no other module memory has been freed yet. So,
> >> >from what I understand, this should be safe.
> >> >After we unload the module these pointers inside the tags will be
> >> >dandling but we should not be using them anymore since we do not
> >> >report unloaded modules. Do you see some usage that I missed?
> >>
> >> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
> >> raised from release_module_tags().
> >>
> >> Is data..percpu. section released earlier?
> >
> >No but counters are different because the allocations that still
> >reference these tags from unloaded modules will be decrementing them
> >when they are freed. That's where UAF is coming from. So, the counters
> >might be accessed after the module is unloaded, other fields should
> >not.
> >
>
> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
> happened "during" free_module():
>
>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> ...
>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> ...
>  [Fri May 16 12:05:41 2025] Call Trace:
>  [Fri May 16 12:05:41 2025]  <TASK>
>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>
> The call chain is the same as you mentioned above.

Is this failure happening before or after my fix? With my fix, percpu
data should not be freed at all if tags are still used. Please
clarify.

> This part confusing me a lot. The log indicates during free_module,  ..data..percpu access failed,
> I doubted those section would be released that quick.
>
> The only guess left  I feel reasonable is the ..data_percpu was not paged in at all,  probably  because no access to it,
> and when the section is accessed during free_module, somehow the access is refused. Just guessing.....
>
> Or,  do I missing something here?
>
>
> >>
> >> Thanks
> >> David
> >>
> >>