linux-kernel - Re: BUG: unable to handle page fault for address

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAJuCfpHprFd2i92QfM+bDQE06eS79Q=CKQJ7GH-Vs3eBBi-yVg@mail.gmail.com>
Date: Mon, 19 May 2025 09:00:46 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: David Wang <00107082@....com>
Cc: kent.overstreet@...ux.dev, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org
Subject: Re: BUG: unable to handle page fault for address

On Sun, May 18, 2025 at 2:55 AM David Wang <00107082@....com> wrote:
>
>
> >>>
> >>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
> >>> happened "during" free_module():
> >>>
> >>>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >>>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >>> ...
> >>>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >>> ...
> >>>  [Fri May 16 12:05:41 2025] Call Trace:
> >>>  [Fri May 16 12:05:41 2025]  <TASK>
> >>>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
> >>> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
> >>>
> >>> The call chain is the same as you mentioned above.
> >>
> >>Is this failure happening before or after my fix? With my fix, percpu
> >>data should not be freed at all if tags are still used. Please
> >>clarify.
> >
> >It is before your fix.  Your patch does fix the issue.
> >
> >In my reproduce procedure:
> >1. enter recovery mode
> >2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
> >3. modprobe drm_client_lib
> >4. install nvidia driver 570.144
> >5. install nvidia driver 550.144.03
> >6. reboot and repeat from step 1
> >
> >The error happened in step 4,  and the failure in step2 is crucial,  if 'modprobe drm_client_lib' at the beginning, no error could be observed.
> >
> >There may be something off about how kernel handles data.percpu section.
> >Good thing is that It can be reproduced,  I can add debug messages to clear or confirm  suspicions,
> >Any suggestion?
> >
> >
> >Thanks
> >David
> >
> >
> After poking around logging memory addresses, I think I finally understand what is happening here.
>
> 1. codetag_alloc_module_section alloc memory when loading module
> 2. module load failed, due to undefined symbol
> 3. codetag section memory not freed
> 4. module load, and module's address happens to reuse the address previous used
> 5. another codetag_alloc_module_section
> 6. percup section allocation and then relocation address changes made to codetag_alloc_module_section
> 7. unload module, when searching through maple tree, the code tag memory in step 1 is used,
> which has no relocation address populated at all.
> 8. page fault error, because tag->counters is 0
>
> I use following changes to log the address,
>
>
> The offending address is
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, bool used)
>         if (!used)
>                 goto release_area;
>
> +       struct alloc_tag *ptag = (struct alloc_tag *)(module_tags.start_addr + mas.index);
> +       pr_info("percpu 0: 0x%llx(0x%llx)\n",
> +                       (long long)per_cpu_ptr(ptag->counters, 0),
> +                       (long long)ptag->counters
> +                       );
>
>
> And got following:
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e030)
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e0)
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa0)
> [Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0)   <------
>
>
> I think, we spot two issues in this thread:
>
> 1. when module load failed after codetag section alloced, the memory would leak.
> 2. counters may needs reference even after module is unloaded.
>
> #2 has already been addressed by your patch. I will send a simple patch to fix #1
>
> (Feel so released to finally draw a conclusion, hope no silly mistakes here  :)

I see. So, layout_and_allocate() succeeds in allocating the codetag
memory but during a later failure we fail to free it. Makes sense and
your patch looks good to me. Thanks!

>
>
> Thanks
> David
>