lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <489a2474.19ea.196e2d20b87.Coremail.00107082@163.com>
Date: Sun, 18 May 2025 17:55:38 +0800 (CST)
From: "David Wang" <00107082@....com>
To: "Suren Baghdasaryan" <surenb@...gle.com>
Cc: kent.overstreet@...ux.dev, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: BUG: unable to handle page fault for address


>>>
>>> I do notice there are places where counters are referenced "after" free_module, but the logs I attached
>>> happened "during" free_module():
>>>
>>>  [Fri May 16 12:05:41 2025] BUG: unable to handle page fault for address: ffff9d28984c3000
>>>  [Fri May 16 12:05:41 2025] #PF: supervisor read access in kernel mode
>>> [Fri May 16 12:05:41 2025] #PF: error_code(0x0000) - not-present page
>>> ...
>>>  [Fri May 16 12:05:41 2025] RIP: 0010:release_module_tags+0x103/0x1b0
>>> ...
>>>  [Fri May 16 12:05:41 2025] Call Trace:
>>>  [Fri May 16 12:05:41 2025]  <TASK>
>>>  [Fri May 16 12:05:41 2025]  codetag_unload_module+0x135/0x160
>>> [Fri May 16 12:05:41 2025]  free_module+0x19/0x1a0
>>>
>>> The call chain is the same as you mentioned above. 
>>
>>Is this failure happening before or after my fix? With my fix, percpu
>>data should not be freed at all if tags are still used. Please
>>clarify.
>
>It is before your fix.  Your patch does fix the issue.
>  
>In my reproduce procedure:
>1. enter recovery mode
>2. install nvidia driver 570.144, failed with Unknown symbol drm_client_setup
>3. modprobe drm_client_lib
>4. install nvidia driver 570.144
>5. install nvidia driver 550.144.03
>6. reboot and repeat from step 1
>
>The error happened in step 4,  and the failure in step2 is crucial,  if 'modprobe drm_client_lib' at the beginning, no error could be observed.
>
>There may be something off about how kernel handles data.percpu section.
>Good thing is that It can be reproduced,  I can add debug messages to clear or confirm  suspicions, 
>Any suggestion?
>
>
>Thanks
>David
>
>
After poking around logging memory addresses, I think I finally understand what is happening here.

1. codetag_alloc_module_section alloc memory when loading module
2. module load failed, due to undefined symbol
3. codetag section memory not freed
4. module load, and module's address happens to reuse the address previous used
5. another codetag_alloc_module_section
6. percup section allocation and then relocation address changes made to codetag_alloc_module_section
7. unload module, when searching through maple tree, the code tag memory in step 1 is used, 
which has no relocation address populated at all.
8. page fault error, because tag->counters is 0

I use following changes to log the address, 


The offending address is 
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -575,6 +575,11 @@ static void release_module_tags(struct module *mod, bool used)
        if (!used)
                goto release_area;
 
+       struct alloc_tag *ptag = (struct alloc_tag *)(module_tags.start_addr + mas.index);
+       pr_info("percpu 0: 0x%llx(0x%llx)\n",
+                       (long long)per_cpu_ptr(ptag->counters, 0),
+                       (long long)ptag->counters
+                       );


And got following:
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee41030(0xffffffffbc57e030)
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee410e0(0xffffffffbc57e0e0)
[Sun May 18 16:25:47 2025] percpu 0: 0xffff8edb6ee40fa0(0xffffffffbc57dfa0)
[Sun May 18 16:26:43 2025] percpu 0: 0xffff8edbb28c3000(0x0)   <------


I think, we spot two issues in this thread:

1. when module load failed after codetag section alloced, the memory would leak.
2. counters may needs reference even after module is unloaded.

#2 has already been addressed by your patch. I will send a simple patch to fix #1

(Feel so released to finally draw a conclusion, hope no silly mistakes here  :)


Thanks
David

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ