linux-kernel - Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241113221804.GC2331600@google.com>
Date: Wed, 13 Nov 2024 22:18:04 +0000
From: Joel Fernandes <joel@...lfernandes.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org,
	Anna-Maria Behnsen <anna-maria@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>,
	Ingo Molnar <mingo@...nel.org>
Subject: Re: [RFC 3/3] tick-sched: Replace jiffie readout with idle_entrytime

On Tue, Nov 12, 2024 at 03:30:24PM +0100, Thomas Gleixner wrote:
> On Fri, Nov 08 2024 at 17:48, Joel Fernandes wrote:
> > This solves the issue where jiffies can be stale and inaccurate.
> 
> Which issue?

I will describe below.

> > Putting some prints, I see that basemono can be quite stale:
> > tick_nohz_next_event: basemono=18692000000 basemono_from_idle_entrytime=18695000000
> 
> What is your definition of stale? 3ms on a system with HZ < 1000 is
> completely correct and within the margin of the next tick, no?

I am on HZ=1000. So 3ms is not within margin of the next tick for me.

> > Since we have 'now' in ts->idle_entrytime, we can just use that. It is
> > more accurate, cleaner, reduces lines of code and reduces any lock
> > contention with the seq locks.
> 
> What's more accurate and what is the actual problem you are trying to
> solve. This handwaving about cleaner, less lines of code and contention
> on a non existing lock is just not helpful.

Oh sure. the concern I have is that jiffies might be quite out of date for
whatever reason as I shared above with comparing the jiffies with
idle_entrytrime. I am not sure exactly why this happens, maybe because
interrupts were disabled on the do-timer CPU? But in any case I was concerned
about this code in tick_nohz_next_event():

	if (rcu_needs_cpu() || arch_needs_cpu() ||
	    irq_work_needs_cpu() || local_timer_softirq_pending()) {
		next_tick = basemono + TICK_NSEC;
	}

If we are using a stale basemono, that just seems wrong to me. If basemono
is stale, then next_tick could even be in the past?

> > I was also concerned about issue where jiffies is not updated for a long
> > time, and then we receive a non-tick interrupt in the future. Relying on
> > stale jiffies value and using that as base can be inaccurate to determine
> > whether next event occurs within next tick. Fix that.
> 
> I'm failing to decode this word salad.

I think I came up with an incorrect explanation of why basemono is lagging -
I have to look deeper into when basemono can be out-of-date - I certainly saw
it being lagging sometimes as I showed in the traces.

Anyway my broken word salad was something like: Say jiffies update did not
happen for a long time (can the tick be turned off on a do-timer CPU?). Then
say if a non-tick interrupt, like a device interrupt bring the CPU out of
idle, and then on the way back to idle, it tries to stop the tick. 'basemono'
would be quite out of date.

I am just speculating, I don't know exactly why I see basemono lagging. I
thought maybe we can fix that and avoid issues in the future that might show
up because of such inaccuracy.

Other theories?

I wonder if you're going to mention that this code works even if basemono is
out-of-date. But I am concerned that in 'low resolution mode', if we have an
hrtimer that is about to expire within the next tick, and if basemono is
lagging by several ticks, then could this code accidentally not keep the tick
alive thinking that the hrtimer is several ticks away?

> > XXX: Need to fix issue in idle accounting which does 'jiffies -
> > idle_entrytime'. If idle_entrytime is more current than jiffies, it
> > could cause negative values. I could replace jiffies with idle_exittime
> > in this computation potentially to fix that.
> 
> So you "fix" some yet to be correctly described issue by breaking stuff?

Its a side-effect of this patch, more adjustments are needed. This is just an
RFC, Thomas :-)

> >  static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> >  {
> > -	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo;
> > +	u64 basemono, next_tick, delta, expires, delta_hr, next_hr_wo, boot_ticks;
> >  	unsigned long basejiff;
> >  	int tick_cpu;
> >  
> > -	basemono = get_jiffies_update(&basejiff);
> > +	boot_ticks = DIV_ROUND_DOWN_ULL(ts->idle_entrytime, TICK_NSEC);
> 
> Again this div/mult is more expensive than the sequence count on 32bit.

Got it, I was hoping that there was a better way to calculate the number of
ticks without expensive div/mult, let me know if you had any ideas on that?
But it seems to me there isn't an alternative. So we may have to table this
idea. Also I think it suffers from the same tick-skew issue you mentioned I
think.

> > -/*
> > - * Read jiffies and the time when jiffies were updated last
> > - */
> > -u64 get_jiffies_update(unsigned long *basej)
> 
> How does this even compile? This function is global for a reason.

Yeah, sorry I had fixed the issue in my next revision when I adjusted the
timer migration code:

https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=timers/tick-sched&id=1cf33bddf50341cf9802ed19374dc42d8466868b

thanks,

 - Joel