linux-kernel - Re: [PATCH v2] mm/kmemleak: Avoid scanning potential huge holes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8bca8139-8a6c-77b2-c295-9698d3662251@redhat.com>
Date:   Wed, 24 Nov 2021 10:07:57 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Lang Yu <lang.yu@....com>, linux-mm@...ck.org,
        Oscar Salvador <osalvador@...e.de>
Cc:     linux-kernel@...r.kernel.org,
        Catalin Marinas <catalin.marinas@....com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH v2] mm/kmemleak: Avoid scanning potential huge holes

On 08.11.21 15:00, Lang Yu wrote:
> When using devm_request_free_mem_region() and devm_memremap_pages()
> to add ZONE_DEVICE memory, if requested free mem region's end pfn
> were huge(e.g., 0x400000000), the node_end_pfn() will be also huge
> (see move_pfn_range_to_zone()). Thus it creates a huge hole between
> node_start_pfn() and node_end_pfn().
> 
> We found on some AMD APUs, amdkfd requested such a free mem region
> and created a huge hole. In such a case, following code snippet was
> just doing busy test_bit() looping on the huge hole.
> 
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> 	struct page *page = pfn_to_online_page(pfn);
> 		if (!page)
> 			continue;
> 	...
> }
> 
> So we got a soft lockup:
> 
> watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
> CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
> RIP: 0010:pfn_to_online_page+0x5/0xd0
> Call Trace:
>   ? kmemleak_scan+0x16a/0x440
>   kmemleak_write+0x306/0x3a0
>   ? common_file_perm+0x72/0x170
>   full_proxy_write+0x5c/0x90
>   vfs_write+0xb9/0x260
>   ksys_write+0x67/0xe0
>   __x64_sys_write+0x1a/0x20
>   do_syscall_64+0x3b/0xc0
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> I did some tests with the patch.
> 
> (1) amdgpu module unloaded
> 
> before the patch:
> 
> real    0m0.976s
> user    0m0.000s
> sys     0m0.968s
> 
> after the patch:
> 
> real    0m0.981s
> user    0m0.000s
> sys     0m0.973s
> 
> (2) amdgpu module loaded
> 
> before the patch:
> 
> real    0m35.365s
> user    0m0.000s
> sys     0m35.354s
> 
> after the patch:
> 
> real    0m1.049s
> user    0m0.000s
> sys     0m1.042s
> 
> v2:
> - Only scan pages belonging to the zone.(David Hildenbrand)
> - Use __maybe_unused to make compilers happy.
> 
> Signed-off-by: Lang Yu <lang.yu@....com>
> ---
>  mm/kmemleak.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> index b57383c17cf6..adbe5aa01184 100644
> --- a/mm/kmemleak.c
> +++ b/mm/kmemleak.c
> @@ -1403,7 +1403,8 @@ static void kmemleak_scan(void)
>  {
>  	unsigned long flags;
>  	struct kmemleak_object *object;
> -	int i;
> +	struct zone *zone;
> +	int __maybe_unused i;
>  	int new_leaks = 0;
>  
>  	jiffies_last_scan = jiffies;
> @@ -1443,9 +1444,9 @@ static void kmemleak_scan(void)
>  	 * Struct page scanning for each node.
>  	 */
>  	get_online_mems();
> -	for_each_online_node(i) {
> -		unsigned long start_pfn = node_start_pfn(i);
> -		unsigned long end_pfn = node_end_pfn(i);
> +	for_each_populated_zone(zone) {
> +		unsigned long start_pfn = zone->zone_start_pfn;
> +		unsigned long end_pfn = zone_end_pfn(zone);
>  		unsigned long pfn;
>  
>  		for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> @@ -1454,8 +1455,8 @@ static void kmemleak_scan(void)
>  			if (!page)
>  				continue;
>  
> -			/* only scan pages belonging to this node */
> -			if (page_to_nid(page) != i)
> +			/* only scan pages belonging to this zone */
> +			if (page_zone(page) != zone)
>  				continue;
>  			/* only scan if page is in use */
>  			if (page_count(page) == 0)
> 

I think in theory we could optimize further, there really isn't that
much need to skip single pages ... we can usually skip whole 
pageblocks. (in some corner cases we might have to back off 
one pageblock and continue the search page-wise). But that's a
different story and there might not be need to optimize.


Also, I wonder if we should adjust the cond_resched() logic instead.
While your code makes the "sparse node" case faster, I think we could
still run into the same issue in the "sparse zone" case now.

Acked-by: David Hildenbrand <david@...hat.com>

to this patch.


diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index b57383c17cf6..1cd1df3cb01b 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1451,6 +1451,9 @@ static void kmemleak_scan(void)
                for (pfn = start_pfn; pfn < end_pfn; pfn++) {
                        struct page *page = pfn_to_online_page(pfn);
 
+                       if (!(pfn & 63))
+                               cond_resched();
+
                        if (!page)
                                continue;
 
@@ -1461,8 +1464,6 @@ static void kmemleak_scan(void)
                        if (page_count(page) == 0)
                                continue;
                        scan_block(page, page + 1, NULL);
-                       if (!(pfn & 63))
-                               cond_resched();
                }
        }
        put_online_mems();


What do you think?

-- 
Thanks,

David / dhildenb