lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170413133255.GA2608@lerouge>
Date:   Thu, 13 Apr 2017 15:32:57 +0200
From:   Frederic Weisbecker <fweisbec@...il.com>
To:     Wanpeng Li <kernellwp@...il.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Mike Galbraith <efault@....de>, Rik van Riel <riel@...hat.com>,
        Luiz Capitulino <lcapitulino@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: [BUG nohz]: wrong user and system time accounting

On Thu, Apr 13, 2017 at 12:31:12PM +0800, Wanpeng Li wrote:
> 2017-04-12 22:57 GMT+08:00 Thomas Gleixner <tglx@...utronix.de>:
> > On Wed, 12 Apr 2017, Frederic Weisbecker wrote:
> >> On Tue, Apr 11, 2017 at 04:22:48PM +0200, Thomas Gleixner wrote:
> >> > It's not different from the current jiffies based stuff at all. Same
> >> > failure mode.
> >>
> >> Yes you're right, I got confused again. So to fix this we could do our snapshots
> >> at a frequency lower than HZ but still high enough to avoid overhead.
> >>
> >> Something like TICK_NSEC / 2 ?
> >
> > If you are using TSC anyway then you can do proper accumulation for both
> > system and user and only account the data when the accumulation is more
> > than a jiffie.
> 
> So I implement it as below:
> 
> - HZ=1000.
>   1) two cpu hogs on cpu in nohz_full mode, 100% user time
>   2) Luzi's testcase, ~95% user time, ~5% idle time (as we expected)
> - HZ=250
>    1) two cpu hogs on cpu in nohz_full mode, 100% user time
>    2) Luzi's testcase, 100% idle
> 
> So the codes below still not work correctly for HZ=250, any suggestions?

Right, so first lets reorder that code a bit so we can see clear inside :-)

> 
> -------------------------------------->8-----------------------------------------------------
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d67eee8..6a11771 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -668,6 +668,8 @@ struct task_struct {
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>      seqcount_t            vtime_seqcount;
>      unsigned long long        vtime_snap;
> +    u64                vtime_acct_utime;
> +    u64                vtime_acct_stime;

You need to accumulate guest and steal time as well.

>      enum {
>          /* Task is sleeping or running in a CPU with VTIME inactive: */
>          VTIME_INACTIVE = 0,
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index f3778e2b..f8e54ba 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -674,20 +674,41 @@ void thread_group_cputime_adjusted(struct
> task_struct *p, u64 *ut, u64 *st)
>  #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
> 
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -static u64 vtime_delta(struct task_struct *tsk)
> +static u64 vtime_delta(struct task_struct *tsk, bool user)
>  {
> -    unsigned long now = READ_ONCE(jiffies);
> +    u64 delta, ret = 0;
> 
> -    if (time_before(now, (unsigned long)tsk->vtime_snap))
> -        return 0;
> +    delta = sched_clock() - tsk->vtime_snap;
> 
> -    return jiffies_to_nsecs(now - tsk->vtime_snap);
> +    if (is_idle_task(tsk)) {
> +        if (delta >= TICK_NSEC)
> +            ret = delta;
> +    } else {
> +        if (user) {
> +            tsk->vtime_acct_utime += delta;
> +            if (tsk->vtime_acct_utime >= TICK_NSEC)
> +                ret = tsk->vtime_acct_utime;
> +        } else {
> +            tsk->vtime_acct_stime += delta;
> +            if (tsk->vtime_acct_utime >= TICK_NSEC)
> +                ret = tsk->vtime_acct_stime;
> +        }

We already have vtime_account_idle, vtime_account_user, etc...
The accumulation should be done by these functions that know what and where
to account. vtime_delta() should really just return the difference against vtime_snap,
it's too low level to care about these details.

> +    }
> +
> +    return ret;
>  }
> 
> -static u64 get_vtime_delta(struct task_struct *tsk)
> +static u64 get_vtime_delta(struct task_struct *tsk, bool user)
>  {
> -    unsigned long now = READ_ONCE(jiffies);
> -    u64 delta, other;
> +    u64 delta = vtime_delta(tsk, user);
> +    u64 other;
> +
> +    if (!is_idle_task(tsk)) {
> +        if (user)
> +            tsk->vtime_acct_utime = 0;
> +        else
> +            tsk->vtime_acct_stime = 0;
> +    }

Like vtime_delta(), get_vtime_delta() shouldn't touch these accumulators.
Reset and accounting really should be done by the upper level functions
vtime_account_*()

Thanks.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ