[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJuCfpHprFd2i92QfM+bDQE06eS79Q=CKQJ7GH-Vs3eBBi-yVg@mail.gmail.com>
Date: Mon, 19 May 2025 09:00:46 -0700
From: Suren Baghdasaryan <surenb@...gle.com>
To: David Wang <00107082@....com>
Cc: kent.overstreet@...ux.dev, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: BUG: unable to handle page fault for address
On Sun, May 18, 2025 at 2:55 AM David Wang <00107082@....com> wrote:
>
>
> >>>
> >>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
> >>> happened "during" free_module():
> >>>
> >>> [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
> >>> [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
> >>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
> >>> ...
> >>> [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
> >>> ...
> >>> [Fri May 16 12:05:41 2025] Call Trace:
> >>> [Fri May 16 12:05:41 2025] <TASK>
> >>> [Fri May 16 12:05:41 2025] codetag_unload_module+0x135/0x160
> >>> [Fri May 16 12:05:41 2025] free_module+0x19/0x1a0
> >>>
> >>> The call chain is the same as you mentioned above.
> >>
> >>Is this failure happening before or after my fix? With my fix, percpu
> >>data should not be freed at all if tags are still used. Please
> >>clarify.
> >
> >It is before your fix. Your patch does fix the issue.
> >
> >In my reproduce procedure:
> >1. enter recovery mode
> >2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
> >3. modprobe drm_client_lib
> >4. install nvidia driver 570.144
> >5. install nvidia driver 550.144.03
> >6. reboot and repeat from step 1
> >
> >The error happened in step 4, and the failure in step2 is crucial, if 'modprobe drm_client_lib' at the beginning, no error could be observed.
> >
> >There may be something off about how kernel handles data.percpu section.
> >Good thing is that It can be reproduced, I can add debug messages to clear or confirm suspicions,
> >Any suggestion?
> >
> >
> >Thanks
> >David
> >
> >
> After poking around logging memory addresses, I think I finally understand what is happening here.
>
> 1. codetag_alloc_module_section alloc memory when loading module
> 2. module load failed, due to undefined symbol
> 3. codetag section memory not freed
> 4. module load, and module's address happens to reuse the address previous used
> 5. another codetag_alloc_module_section
> 6. percup section allocation and then relocation address changes made to codetag_alloc_module_section
> 7. unload module, when searching through maple tree, the code tag memory in step 1 is used,
> which has no relocation address populated at all.
> 8. page fault error, because tag->counters is 0
>
> I use following changes to log the address,
>
>
> The offending address is
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, bool used)
> if (!used)
> goto release_area;
>
> + struct alloc_tag *ptag = (struct alloc_tag *)(module_tags.start_addr + mas.index);
> + pr_info("percpu 0: 0x%llx(0x%llx)\n",
> + (long long)per_cpu_ptr(ptag->counters, 0),
> + (long long)ptag->counters
> + );
>
>
> And got following:
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e030)
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e0)
> [Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa0)
> [Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0) <------
>
>
> I think, we spot two issues in this thread:
>
> 1. when module load failed after codetag section alloced, the memory would leak.
> 2. counters may needs reference even after module is unloaded.
>
> #2 has already been addressed by your patch. I will send a simple patch to fix #1
>
> (Feel so released to finally draw a conclusion, hope no silly mistakes here :)
I see. So, layout_and_allocate() succeeds in allocating the codetag
memory but during a later failure we fail to free it. Makes sense and
your patch looks good to me. Thanks!
>
>
> Thanks
> David
>
Powered by blists - more mailing lists