linux-kernel - Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160627122553.GA2111@lerouge>
Date:	Mon, 27 Jun 2016 14:25:56 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	riel@...hat.com
Cc:	linux-kernel@...r.kernel.org, peterz@...radead.org,
	mingo@...nel.org, pbonzini@...hat.com, fweisbec@...hat.com,
	wanpeng.li@...mail.com, efault@....de, tglx@...utronix.de,
	rkrcmar@...hat.com
Subject: Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time

On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@...hat.com wrote:
> From: Rik van Riel <riel@...hat.com>
> 
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
> 
> This is inaccurate if only a subset of 'ticks' jiffies was
> actually spent handling irqs, and could conceivably mis-count
> all of the ticks during a period as irq time, when there was
> some irq and some softirq time.

Good catch!

Many comments following.

> 
> This can actually happen when irqtime_account_process_tick
> is called from account_idle_ticks, which can pass a larger
> number of ticks down all at once.
> 
> Fix this by changing irqtime_account_hi_update and
> irqtime_account_si_update to round elapsed irq and softirq
> time to jiffies, and return the number of jiffies spent in
> each mode, similar to how steal time is handled.
> 
> Additionally, have irqtime_account_process_tick take into
> account how much time was spent in each of steal, irq,
> and softirq time.
> 
> The latter could help improve the accuracy of timekeeping

Maybe you meant cputime? Timekeeping is rather about jiffies and GTOD.

> when returning from idle on a NO_HZ_IDLE CPU.
> 
> Properly accounting how much time was spent in hardirq and
> softirq time will also allow the NO_HZ_FULL code to re-use
> these same functions for hardirq and softirq accounting.
> 
> Signed-off-by: Rik van Riel <riel@...hat.com>
> ---
>  kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 60 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 3d60e5d76fdb..15b813c014be 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
>  }
>  EXPORT_SYMBOL_GPL(irqtime_account_irq);
>  
> -static int irqtime_account_hi_update(void)
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long irq_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 irq;

Should be cputime_t ? And perhaps renamed to irq_cputime to distinguish
clearly against irq_jiffies?

>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_hardirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> -		ret = 1;
> +	irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];

cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in fact
even cputime64_t). So you need to convert cpu_hardirq_time before doing the
substract.

> +	if (irq > cputime_one_jiffy) {
> +		irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
> +		cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return irq_jiffies;

I think we shouldn't use jiffies at all here and only rely on cputime_t.
max_jiffies should be of cputime_t and this function should return the cputime_t.

The resulting code should be more simple and in fact more precise (more below on that).

>  }
>  
> -static int irqtime_account_si_update(void)
> +static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
>  {
>  	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long si_jiffies = 0;
>  	unsigned long flags;
> -	u64 latest_ns;
> -	int ret = 0;
> +	u64 softirq;
>  
>  	local_irq_save(flags);
> -	latest_ns = this_cpu_read(cpu_softirq_time);
> -	if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
> -		ret = 1;
> +	softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
> +	if (softirq > cputime_one_jiffy) {
> +		si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
> +		cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
> +	}
>  	local_irq_restore(flags);
> -	return ret;
> +	return si_jiffies;

So same comments apply here.

[...]
>   * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
>   * tasks (sum on group iteration) belonging to @tsk's group.
>   */
> @@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>  {
>  	cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
>  	u64 cputime = (__force u64) cputime_one_jiffy;
> -	u64 *cpustat = kcpustat_this_cpu->cpustat;
> +	unsigned long other;
>  
> -	if (steal_account_process_tick(ULONG_MAX))
> +	/*
> +	 * When returning from idle, many ticks can get accounted at
> +	 * once, including some ticks of steal, irq, and softirq time.
> +	 * Subtract those ticks from the amount of time accounted to
> +	 * idle, or potentially user or system time. Due to rounding,
> +	 * other time can exceed ticks occasionally.
> +	 */
> +	other = account_other_ticks(ticks);
> +	if (other >= ticks)
>  		return;
> +	ticks -= other;
>  
>  	cputime *= ticks;
>  	scaled *= ticks;

So instead of dealing with ticks here, I think you should rather use the above
cputime as both the limit and the remaining time to account after steal/irqs.

This should avoid some middle conversions and improve precision when cputime_t == nsecs granularity.

If we account 2 ticks to idle (lets say HZ=100) and irq time to account is 15 ms. 2 ticks = 20 ms
so we have 5 ms left to account to idle. With the jiffies granularity in this patch, we would account
one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account back later) and one tick to idle
time whereas if you deal with cputime_t, you are going to account the correct amount of idle time.

Beyond the scope of this patchset, we should probably kill cputime_t and account everything to nsecs.
I have a patchset that did that 2 years ago, I should probably re-iterate it.

Thanks.