linux-kernel - Re: [PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aQDSMxKDr85kPCJJ@swahl-home.5wahls.com>
Date: Tue, 28 Oct 2025 09:24:51 -0500
From: Steve Wahl <steve.wahl@....com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: Steve Wahl <steve.wahl@....com>, Russ Anderson <rja@....com>,
        Dimitri Sivanich <sivanich@....com>, Kyle Meyer <kyle.meyer@....com>,
        Anna-Maria Behnsen <anna-maria@...utronix.de>,
        Frederic Weisbecker <frederic@...nel.org>,
        Ingo Molnar <mingo@...nel.org>, Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] tick/sched: Limit non-timekeeper CPUs calling jiffies
 update

On Tue, Oct 28, 2025 at 11:39:30AM +0530, Shrikanth Hegde wrote:
> 
> 
> On 10/28/25 12:04 AM, Steve Wahl wrote:
> > On large NUMA systems, while running a test program that saturates the
> > inter-processor and inter-NUMA links, acquiring the jiffies_lock can
> > be very expensive.  If the cpu designated to do jiffies updates
> > (tick_do_timer_cpu) gets delayed and other cpus decide to do the
> > jiffies update themselves, a large number of them decide to do so at
> > the same time.  The inexpensive check against tick_next_period is far
> > quicker than actually acquiring the lock, so most of these get in line
> > to obtain the lock.  If obtaining the lock is slow enough, this
> > spirals into the vast majority of CPUs continuously being stuck
> > waiting for this lock, just to obtain it and find out that time has
> > already been updated by another cpu. For example, on one random entry
> > to kdb by manually-injected NMI, I saw 2912 of 3840 cpus stuck here.
> > 
> > To avoid this, allow only one non-timekeeper CPU to call
> > tick_do_update_jiffies64() at any given time, resetting ts->stalled
> > jiffies only if the jiffies update function is actually called.
> > 
> > With this change, manually interrupting the test I find at most two
> > CPUs in the tick_do_update_jiffies64 function (the timekeeper and one
> > other).
> > 
> > Signed-off-by: Steve Wahl <steve.wahl@....com>
> > ---
> > 
> > v2: Rewritten to use an atomic to gate non-timekeeping cpus calling the
> >      jiffies update, as suggested by tglx. Title of patch has changed
> >      since trylock is no longer used.
> > 
> > v1 discussion:
> > https://lore.kernel.org/all/20251013150959.298288-1-steve.wahl@hpe.com/
> > 
> >   kernel/time/tick-sched.c | 30 ++++++++++++++++++++++++++----
> >   1 file changed, 26 insertions(+), 4 deletions(-)
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index c527b421c865..3ff3eb1f90d0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -201,6 +201,27 @@ static inline void tick_sched_flag_clear(struct tick_sched *ts,
> >   	ts->flags &= ~flag;
> >   }
> > +/*
> > + * Allow only one non-timekeeper CPU at a time update jiffies from
> > + * the timer tick.
> > + *
> > + * Returns true if update was run.
> > + */
> > +static bool tick_limited_update_jiffies64(struct tick_sched *ts, ktime_t now)
> > +{
> > +	static atomic_t in_progress;
> > +	int inp;
> > +
> > +	inp = atomic_read(&in_progress);
> > +	if (inp || !atomic_try_cmpxchg(&in_progress, &inp, 1))
> > +		return false;
> > +
> 
> You come here if (ts->last_tick_jiffies == jiffies). So it may be not necessary to check again.

TGLX had this in his rewrite suggestion, and I looked pretty intensely
at this test.

The situation I'm looking to resolve is caused by inter-NUMA links
being abnormally swamped with traffic.  Especially for writes, access
to shared memory locations, such as the atomic operations to
in_progress right above this, take longer than one usually would
expect.  So to me it makes sense that things may have changed since
the atomic_try_cmpxchg was initiated, and so I left the check in
place.

> > +	if (ts->last_tick_jiffies == jiffies)
> > +		tick_do_update_jiffies64(now);
> > +	atomic_set(&in_progress, 0);
> > +	return true;
> > +}
> > +
> >   #define MAX_STALLED_JIFFIES 5
> >   static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> > @@ -239,10 +260,11 @@ static void tick_sched_do_timer(struct tick_sched *ts, ktime_t now)
> >   		ts->stalled_jiffies = 0;
> >   		ts->last_tick_jiffies = READ_ONCE(jiffies);
> >   	} else {
> > -		if (++ts->stalled_jiffies == MAX_STALLED_JIFFIES) {
> > -			tick_do_update_jiffies64(now);
> > -			ts->stalled_jiffies = 0;
> > -			ts->last_tick_jiffies = READ_ONCE(jiffies);
> > +		if (++ts->stalled_jiffies >= MAX_STALLED_JIFFIES) {
> > +			if (tick_limited_update_jiffies64(ts, now)) {
> > +				ts->stalled_jiffies = 0;
> > +				ts->last_tick_jiffies = READ_ONCE(jiffies);
> > +			}
> >   		}
> >   	}
> 
> 
> Yes. This could help large systems.
> 
> Acked-by: Shrikanth Hegde <sshegde@...ux.ibm.com>

Thanks for your time reviewing!

--> Steve Wahl

-- 
Steve Wahl, Hewlett Packard Enterprise