lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170407125801.503ae098@redhat.com>
Date:   Fri, 7 Apr 2017 12:58:01 -0400
From:   Luiz Capitulino <lcapitulino@...hat.com>
To:     Wanpeng Li <kernellwp@...il.com>
Cc:     linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Wanpeng Li <wanpeng.li@...mail.com>,
        Frederic Weisbecker <fweisbec@...il.com>,
        Rik van Riel <riel@...hat.com>,
        Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH] tick/nohz: Fix wrong user and system time accouting
 against vtime sampling

On Wed,  5 Apr 2017 23:36:24 -0700
Wanpeng Li <kernellwp@...il.com> wrote:

> From: Wanpeng Li <wanpeng.li@...mail.com>
> 
> This can be reproduced by setting up a CPU for nohz_full and pin two tasks 
> that hog the CPU 100% of the time to that CPU, top reports 70% system time 
> and 30% user time utilization sometimes. The trace like this:
> 
>    hog-11980 [015]   341.494491: function:    enter_from_user_mode <-- apic_timer_interrupt
> <idle>-0     [000]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    __context_tracking_exit <-- enter_from_user_mode
> <idle>-0     [000]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
>    hog-11980 [015]   341.494492: bprint:      vtime_delta: diff=0 (now=4295008339 vtime_snap=4295008339)
>    hog-11980 [015]   341.494492: function:    smp_apic_timer_interrupt <-- apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    irq_enter <-- smp_apic_timer_interrupt
>    hog-11980 [015]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_sched_timer <-- __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
> <idle>-0     [000]   341.494494: function:    do_timer <-- tick_do_update_jiffies64.part.14
>    hog-11980 [015]   341.494494: function:    irq_exit <-- smp_apic_timer_interrupt
> <idle>-0     [000]   341.494494: bprint:      do_timer: updated jiffies_64=4295008340 ticks=1
>    hog-11980 [015]   341.494494: function:    __context_tracking_enter <-- prepare_exit_to_usermode
>    hog-11980 [015]   341.494494: function:    vtime_user_enter <-- __context_tracking_enter
>    hog-11980 [015]   341.494495: bprint:      vtime_delta: diff=1000000 (now=4295008340 vtime_snap=4295008339)
>    hog-11980 [015]   341.494495: function:    __vtime_account_system <-- vtime_user_enter
>    hog-11980 [015]   341.494495: bprint:      get_vtime_delta: vtime_snap=4295008339 now=4295008340
>    hog-11980 [015]   341.494495: function:    account_system_time <-- __vtime_account_system
>    hog-11980 [015]   341.494495: bprint:      account_system_time: cputime=995488
> <idle>-0     [000]   341.494497: function:    irq_exit <-- smp_apic_timer_interrupt
> 
> In this trace, we see the following:
> 
>  1. On CPU15, we transition from user-space to kernel-space because
>     of a timer interrupt (it's the tick)
> 
>  2. vtimer_delta() returns 0, because jiffies didn't change since the
>     last accounting
> 
>  3. While CPU15 is executing in kernel-space, jiffies is updated
>     by CPU0
> 
>  4. When going back to user-space, vtime_delta() returns non-zero
>     and the whole time is accounted for system time (observe how
>     the cputime parameter in account_system_time() is less than 1ms)
> 
> 
> After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN 
> to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to 
> jiffy based sampling. So the issue can happen if all CPUs fire their ticks 
> at the same time:
> 
> 
>                CPU 0                  CPU 1
>                -----                  -----
>                                       exit_user() // no cputime update 
> tick X         update_jiffies
>                                       enter_user() // cputime update
> 
> 
>                                       exit_user() // no cputime update
> tick X+1       update_jiffies
>                                       enter_user() // cputime update
> 
> The nohz_full CPU receives an interrupt at the same time the timer 
> interrupt fires on the housekeeping CPU.
> 
> This patch offsets the tick to avert all ticks alignment in order 
> that the vtime sampling does not end up "in phase" with the jiffies 
> incrementing.
> 
> Reported-by: Luiz Capitulino <lcapitulino@...hat.com>
> Suggested-by: Rik van Riel <riel@...hat.com>
> Cc: Frederic Weisbecker <fweisbec@...il.com>
> Cc: Rik van Riel <riel@...hat.com>
> Cc: Mike Galbraith <efault@....de>
> Cc: Luiz Capitulino <lcapitulino@...hat.com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Signed-off-by: Wanpeng Li <wanpeng.li@...mail.com>

Reviewed-and-Tested-by: Luiz Capitulino <lcapitulino@...hat.com>

Thanks for everyone who participated on tracking this down
and fixing it!

> ---
>  kernel/time/tick-sched.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..35d79f7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
>  	/* Get the next period (per-CPU) */
>  	hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>  
> -	/* Offset the tick to avert jiffies_lock contention. */
> -	if (sched_skew_tick) {
> +	/*
> +	 * Offset the tick to avert jiffies_lock contention, and all ticks
> +	 * alignment in order that the vtime sampling does not end up "in
> +	 * phase" with the jiffies incrementing.
> +	 */
> +	if (sched_skew_tick || tick_nohz_full_enabled()) {
>  		u64 offset = ktime_to_ns(tick_period) >> 1;
>  		do_div(offset, num_possible_cpus());
>  		offset *= smp_processor_id();

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ