linux-kernel - Re: [PATCH v2 3/3] timers: Disable memory pre-allocation of timer debug objects

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b04ae490-0878-4d39-a6c8-406d4dd3728f@redhat.com>
Date: Fri, 13 Jun 2025 11:13:54 -0400
From: Waiman Long <llong@...hat.com>
To: Thomas Gleixner <tglx@...utronix.de>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Anna-Maria Behnsen <anna-maria@...utronix.de>,
 Frederic Weisbecker <frederic@...nel.org>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 3/3] timers: Disable memory pre-allocation of timer
 debug objects


On 6/12/25 6:03 PM, Thomas Gleixner wrote:
> On Thu, Jun 05 2025 at 23:15, Waiman Long wrote:
>> A circular locking dependency lockdep splat was hit recently with a
>> debug kernel. The dependency chain (in reverse order) is:
>>
>>    -> #3 (&zone->lock){-.-.}-{2:2}:
>>    -> #2 (&base->lock){-.-.}-{2:2}:
>>    -> #1 (&console_sch_key){-.-.}-{2:2}:
>>    -> #0 (console_owner){..-.}-{0:0}:
>>
>> The last one is from calling printk() within the rmqueue_bulk() call in
>> mm/page_alloc.c. The "base->lock" is from lock_timer_base() and first
>> one is due to calling add_timer_on() leading to debug_object_activate()
>> doing actual memory allocation acquiring the zone lock.
>>
>> The console_sch_key comes from a s390 console driver in driver/s390/cio.
>> The console_sch_key -> timer dependency happens because the console
>> driver is setting a timeout value while holding its lock. Apparently it
>> is pretty common for a console driver to use timer for timeout or other
>> timing purposes. So this may happen to other console drivers as well.
>>
>> One way to break this circular locking dependency is to disallow any
>> memory allocation when a timer debug object is being handled. Do this by
>> setting the ODEBUG_FLAG_NO_ALLOC flag in the timer_debug_descr
>> structure.
> Well. I'm absolutely not convinced that this is the right approach.
>
> I have a hard time to find the printk() in rmqueue_bulk(). But if there
> is one then it has to go or has to be converted to a deferred printk()
> simply because that code can be called from so many contexts, which all
> can legitimately create a lock dependency chain into the console drivers
> in some way or the other. That's like invoking printk() from the guts of
> the scheduler or locking code.

Actually, rmqueue_bulk() calls expand() which, in turn, calls 
__add_to_free_list() and the printk() comes from the VM_WARN_ONCE() 
macro there.

In a sense, printk() is called because of some other issues in the mm code.

>
>> The figures below show the number of times the debug_objects_fill_pool()
>> function has reached the statement right before and after the no_alloc
>> check in initial bootup and after running a parallel kernel build on
>> a 2-socket 96-threads x86-64 system.
>>
>> 			 Before      After     non-timer %
>> 		 	 ------      -----     -----------
>>    Initial bootup	  150,730     148,198     98.3%
>>    Parallel kernel build 5,974,464   5,893,116     98.6%
>>
>> So from object pre-allocation perspective, timer debug objects represent
>> just a small slice of the total number of debug objects to be processed.
> That math is skewed due to the way how debugobjects handles the
> allocations for the global pool.
>
> The initial decision to attempt a refill is:
>
>      count < min_cnt
>
> where min_cnt = 256 + 16 * num_possible_cpus()
>
> That makes _one_ context go into the allocation path unless
>
>      count < min_cnt / 2
>
> which forces all contexts to try allocating in order not to deplete the
> pool.
>
> So let's assume we have 16 CPUs, then min_cnt = 512 and therefore
> min_cnt / 2 = 256.
>
> As the initial context which allocates when the count goes below 512
> might be preempted, the rest of the callers can lower the pool count to
> 256 easily.
>
> In the 0-day splat the debug_objects OOM happens from o2net_init():
>
> [ 92.566274][ T1] debug_object_init (kbuild/obj/consumer/x86_64-randconfig-003-20250608/lib/debugobjects.c:785)
> [ 92.566777][ T1] init_timer_key (kbuild/obj/consumer/x86_64-randconfig-003-20250608/arch/x86/include/asm/jump_label.h:36
> [ 92.567230][ T1] o2net_init (kbuild/obj/consumer/x86_64-randconfig-003-20250608/fs/ocfs2/cluster/tcp.c:2128 (discriminator 3))
> [ 92.567629][ T1] init_o2nm (kbuild/obj/consumer/x86_64-randconfig-003-20250608/fs/ocfs2/cluster/nodemanager.c:832)
> [ 92.568023][ T1] do_one_initcall (kbuild/obj/consumer/x86_64-randconfig-003-20250608/init/main.c:1257)
>
> o2net_init() initializes 255 nodes and each node has three delayed works. Each
> delayed work contains a timer for which debugobjects needs to create a
> new tracking object. So with your brute force approach of disabling
> allocations for timers blindy o2net_init() can trivially deplete the
> pool.
>
> For the o2net case this requires that workqueue debugobjects are
> disabled, but you can't argue that this is silly because there are other
> code paths which do bulk initialization of timers w/o having a work
> involved.
>
> So using the percentage of timer operations for evaluating how this
> change can cause a debug object OOM is just bogus and wishful thinking.
>
> Let's take a step back and ask the obvious question, when there is
> actually consumption of debug objects happening:
>
>    1) For all dynamically initialized objects it happens in
>       debug_object_init()
>
>    2) For statically initialized objects it happens in
>       debug_object_activate()
>
> #2 is arguably irrelevant as there are not gazillions of statically
>     initialized objects, which are trackable by debugobjects
>
> #1 is the vast majority and the good news is that the initialization of
>     such objects (after allocation) happens mostly in preemptible context
>     with no locks held.
>
> So the obvious thing to try is not adding some random flag to timers
> (and tomorrow to RCU and work), but to restrict the allocation
> requirement to debug_object_init().
>
> Something like the untested below should make all of these headaches
> go away. Except for the general observation that debugobjects is not the
> only way to create nasty dependency chains into console driver locks,
> but that's a headache the MM folks have to sort out.
>
> Thanks,
>
>          tglx
> ---
>
> --- a/lib/debugobjects.c
> +++ b/lib/debugobjects.c
> @@ -811,8 +811,6 @@ int debug_object_activate(void *addr, co
>   	if (!debug_objects_enabled)
>   		return 0;
>   
> -	debug_objects_fill_pool();
> -
>   	db = get_bucket((unsigned long) addr);
>   
>   	raw_spin_lock_irqsave(&db->lock, flags);
> @@ -1000,8 +998,6 @@ void debug_object_assert_init(void *addr
>   	if (!debug_objects_enabled)
>   		return;
>   
> -	debug_objects_fill_pool();
> -
>   	db = get_bucket((unsigned long) addr);
>   
>   	raw_spin_lock_irqsave(&db->lock, flags);
>
Thanks for the suggestion. I will take further look at this problem.

Cheers,
Longman