[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <551cd408.515.196e11108a5.Coremail.00107082@163.com>
Date: Sun, 18 May 2025 09:45:12 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Suren Baghdasaryan" <surenb@...gle.com>
Cc: kent.overstreet@...ux.dev, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: BUG: unable to handle page fault for address
At 2025-05-18 03:02:37, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>On Sat, May 17, 2025 at 10:57 AM David Wang <00107082@....com> wrote:
>>
>>
>>
>> At 2025-05-18 01:29:35, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>> >On Sat, May 17, 2025 at 9:51 AM David Wang <00107082@....com> wrote:
>> >>
>> >>
>> >> At 2025-05-18 00:39:30, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>> >> >On Sat, May 17, 2025 at 12:02 AM David Wang <00107082@....com> wrote:
>> >> >>
>> >> >>
>> >> >> At 2025-05-17 08:11:24, "Suren Baghdasaryan" <surenb@...gle.com> wrote:
>> >> >> >On Fri, May 16, 2025 at 10:03 AM Suren Baghdasaryan <surenb@...gle.com> wrote:
>> >> >> >>
>> >> >> >> Hi David,
>> >> >> >>
>> >> >> >> On Fri, May 16, 2025 at 6:13 AM David Wang <00107082@....com> wrote:
>> >> >> >> >
>> >> >> >> > Hi,
>> >> >> >> >
>> >> >> >> > I caught a page fault when I was changing my nvidia driver:
>> >> >> >> > (This happens randomly, I can reproduce it with about 1/3 probability)
>> >> >> >> >
>> >> >> >> > [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> >> >> >> > [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> >> >> >> > [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> >> >> >> > ...
>> >> >> >> > [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> >> >> >> > ...
>> >> >> >> > [Fri May 16 12:05:41 2025] Call Trace:
>> >> >> >> > [Fri May 16 12:05:41 2025] <TASK>
>> >> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module+0x135/0x160
>> >> >> >> > [Fri May 16 12:05:41 2025] free_module+0x19/0x1a0
>> >> >> >> > ...
>> >> >> >> > (full kernel logs are pasted at the end.)
>> >> >> >> >
>> >> >> >> > Using a image with DEBUG_INFO, the calltrack parses as:
>> >> >> >> >
>> >> >> >> > RIP: 0010:release_module_tags (./include/linux/alloc_tag.h:134 lib/alloc_tag.c:352 lib/alloc_tag.c:573)
>> >> >> >> > [Fri May 16 12:05:41 2025] codetag_unload_module (lib/codetag.c:355)
>> >> >> >> > [Fri May 16 12:05:41 2025] free_module (kernel/module/main.c:1305)
>> >> >> >> > [Fri May 16 12:05:41 2025] __do_sys_delete_module (kernel/module/main.c:795)
>> >> >> >> >
>> >> >> >> > The offending lines in my codebase:
>> >> >> >> > 126 static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
>> >> >> >> > 127 {
>> >> >> >> > ...
>> >> >> >> > 131
>> >> >> >> > 132 for_each_possible_cpu(cpu) {
>> >> >> >> > 133 counter = per_cpu_ptr(tag->counters, cpu);
>> >> >> >> > >>>> 134 v.bytes += counter->bytes; <--------------here
>> >> >> >> > 135 v.calls += counter->calls;
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > Nvidia drivers are out-tree... there could be some strange behavior in it causes this.. but,
>> >> >> >> > when I check the code, I got concerned about lifecycle of tag->counters.
>> >> >> >> > Based on following defination:
>> >> >> >> > 108 #define DEFINE_ALLOC_TAG(_alloc_tag) \
>> >> >> >> > 109 static DEFINE_PER_CPU(struct alloc_tag_counters, _alloc_tag_cntr); \
>> >> >> >> > 110 static struct alloc_tag _alloc_tag __used __aligned(8) \
>> >> >> >> > 111 __section(ALLOC_TAG_SECTION_NAME) = { \
>> >> >> >> > 112 .ct = CODE_TAG_INIT, \
>> >> >> >> > 113 .counters = &_alloc_tag_cntr };
>> >> >> >> > 114
>> >> >> >> > _alloc_tag_cntr is the data referenced by tag->counters, but they are in different section,
>> >> >> >> > and alloc_tag only prepare storage for section ALLOC_TAG_SECTION_NAME.
>> >> >> >> > right?
>> >> >> >> > Then what happens to those ".data..percpu" section when module is unloaded?
>> >> >> >> > Is it safe to keep using those ".data..percpu" section after module unloaded,
>> >> >> >> > or even during module is unloading?
>> >> >> >>
>> >> >> >> Yes, I think you are right, free_module() calls percpu_modfree() which
>> >> >> >> would free the per-cpu memory allocated for the module. Before
>> >> >> >> 0db6f8d7820a ("alloc_tag: load module tags into separate contiguous
>> >> >> >> memory") we would not unload the module if there were tags which were
>> >> >> >> still in use. After that change we load module tags into separate
>> >> >> >> memory, so I expected this to work but due to this external reference
>> >> >> >> it indeed should lead to UAF.
>> >> >> >> I think the simplest way to fix this would be to bypass
>> >> >> >> percpu_modfree() inside free_module() when there are module tags still
>> >> >> >> referenced, store mod->percpu inside alloc_tag_module_section and free
>> >> >> >> it inside clean_unused_module_areas_locked() once we know the counters
>> >> >> >> are not used anymore. I'll take a stab at it and will send a patch for
>> >> >> >> testing today.
>> >> >> >
>> >> >> >Ok, I went with another implementation, instead dynamically allocating
>> >> >> >percpu memory for modules at the module load time. This has another
>> >> >> >advantage of not needing extra PERCPU_MODULE_RESERVE currently
>> >> >> >required for memory allocation tagging to work.
>> >> >> >David, the patch is posted at [1]. Please give it a try and let me
>> >> >> >know if the fix works for you.
>> >> >> >Thanks,
>> >> >> >Suren.
>> >> >> >
>> >> >> >[1] https://lore.kernel.org/all/20250517000739.5930-1-surenb@google.com/
>> >> >> >
>> >> >> >
>> >> >> >> Thanks,
>> >> >> >> Suren.
>> >> >> >>
>> >> >>
>> >> >> Hi, the patch does fix my issue.
>> >> >> I now have another similar concern about modules RO data,
>> >> >> The codetag defined as
>> >> >> 24 struct codetag {
>> >> >> 25 unsigned int flags; /* used in later patches */
>> >> >> 26 unsigned int lineno;
>> >> >> 27 const char *modname;
>> >> >> 28 const char *function;
>> >> >> 29 const char *filename;
>> >> >> 30 } __aligned(8);
>> >> >>
>> >> >> Those modname/function/filename would refer to RO data section, right?
>> >> >> When module unloaded, its RO data section would be released at some point.
>> >> >> My question is is it safe to use RO data during module unload? because these
>> >> >> lines seems to access those data:
>> >> >>
>> >> >> + pr_info("%s:%u module %s func:%s has %llu allocated at module unload\n",
>> >> >> + tag->ct.filename, tag->ct.lineno, tag->ct.modname,
>> >> >> + tag->ct.function, counter.bytes);
>> >> >
>> >> >These lines are called from release_module_tags() using this call chain:
>> >> >
>> >> >delete_module()
>> >> > free_module()
>> >> > codetag_unload_module()
>> >> > release_module_tags()
>> >> >
>> >> >and codetag_unload_module() is called at the very beginning of
>> >> >free_module(), when no other module memory has been freed yet. So,
>> >> >from what I understand, this should be safe.
>> >> >After we unload the module these pointers inside the tags will be
>> >> >dandling but we should not be using them anymore since we do not
>> >> >report unloaded modules. Do you see some usage that I missed?
>> >>
>> >> Why data..percpu. is different. The page fault error caught when I reinstall nvidia drivers is also
>> >> raised from release_module_tags().
>> >>
>> >> Is data..percpu. section released earlier?
>> >
>> >No but counters are different because the allocations that still
>> >reference these tags from unloaded modules will be decrementing them
>> >when they are freed. That's where UAF is coming from. So, the counters
>> >might be accessed after the module is unloaded, other fields should
>> >not.
>> >
>>
>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
>> happened "during" free_module():
>>
>> [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>> [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>> ...
>> [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>> ...
>> [Fri May 16 12:05:41 2025] Call Trace:
>> [Fri May 16 12:05:41 2025] <TASK>
>> [Fri May 16 12:05:41 2025] codetag_unload_module+0x135/0x160
>> [Fri May 16 12:05:41 2025] free_module+0x19/0x1a0
>>
>> The call chain is the same as you mentioned above.
>
>Is this failure happening before or after my fix? With my fix, percpu
>data should not be freed at all if tags are still used. Please
>clarify.
It is before your fix. Your patch does fix the issue.
In my reproduce procedure:
1. enter recovery mode
2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
3. modprobe drm_client_lib
4. install nvidia driver 570.144
5. install nvidia driver 550.144.03
6. reboot and repeat from step 1
The error happened in step 4, and the failure in step2 is crucial, if 'modprobe drm_client_lib' at the beginning, no error could be observed.
There may be something off about how kernel handles data.percpu section.
Good thing is that It can be reproduced, I can add debug messages to clear or confirm suspicions,
Any suggestion?
Thanks
David
>
>> This part confusing me a lot. The log indicates during free_module, ..data..percpu access failed,
>> I doubted those section would be released that quick.
>>
>> The only guess left I feel reasonable is the ..data_percpu was not paged in at all, probably because no access to it,
>> and when the section is accessed during free_module, somehow the access is refused. Just guessing.....
>>
>> Or, do I missing something here?
>>
>>
>> >>
>> >> Thanks
>> >> David
>> >>
>> >>
Powered by blists - more mailing lists