linux-kernel - Re: [PATCH] psi:fix divide by zero in psi_update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJuCfpGFDEMDhy-YQQMmrCeVtqH1hj3J1DZ497K3Ttw0U7osJw@mail.gmail.com>
Date:   Fri, 29 Nov 2019 17:41:55 -0800
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Jingfeng Xie <xiejingfeng@...ux.alibaba.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Xunlei Pang <xlpang@...ux.alibaba.com>,
        齐江(窅默) <qijiang.qj@...baba-inc.com>
Subject: Re: [PATCH] psi:fix divide by zero in psi_update_stats

On Thu, Nov 28, 2019 at 10:37 PM Jingfeng Xie
<xiejingfeng@...ux.alibaba.com> wrote:
>
> Weiner,
> The crash does not happen right after boot, in my case,  it happens in 58914 ~ 815463 seconds range since boot
>
> With my coredump，some values are extracted as below：
>
> period = 001df2dc00000000
> now = 001df2dc00000000， same as period
> expires = group->next_update = rdi = 00003594f700648e
> group->avg_last_update  could not be known
> missed_periods = 0
>
Considering that "period = now - (group->avg_last_update +
(missed_periods * psi_period))" and the above values (period==now and
missed_periods==0), group->avg_last_update must be 0 and that would
mean this is indeed the first update_averages() call.
I think this can happen if a cgroup is created long after the boot.
The following call chain would happen:
cgroup_create->psi_cgroup_alloc->group_init->INIT_DELAYED_WORK->psi_avgs_work->update_averages.
If this cgroup creation is timed so that psi_avgs_work is called when
sched_clock returns a value with LSBs of 0 then we get this problem.
The patch Johannes posted earlier which sets group->avg_last_update to
sched_clock in group_init should have fixed this problem. Tim, did you
capture this coredump after applying that patch? If not please try
applying it and see if it still happens.


> 在 2019/11/13 上午12:08，“Johannes Weiner”<hannes@...xchg.org> 写入:
>
>     On Tue, Nov 12, 2019 at 10:48:46AM -0500, Johannes Weiner wrote:
>     > On Tue, Nov 12, 2019 at 10:41:46AM -0500, Johannes Weiner wrote:
>     > > On Fri, Nov 08, 2019 at 03:33:24PM +0800, tim wrote:
>     > > > In psi_update_stats, it is possible that period has value like
>     > > > 0xXXXXXXXX00000000 where the lower 32 bit is 0, then it calls div_u64 which
>     > > > truncates u64 period to u32, results in zero divisor.
>     > > > Use div64_u64() instead of div_u64()  if the divisor is u64 to avoid
>     > > > truncation to 32-bit on 64-bit platforms.
>     > > >
>     > > > Signed-off-by: xiejingfeng <xiejingfeng@...ux.alibaba.com>
>     > >
>     > > This is legit. When we stop the periodic averaging worker due to an
>     > > idle CPU, the period after restart can be much longer than the ~4 sec
>     > > in the lower 32 bits. See the missed_periods logic in update_averages.
>     >
>     > Argh, that's not right. Of course I notice right after hitting send.
>     >
>     > missed_periods are subtracted out of the difference between now and
>     > the last update, so period should be not much bigger than 2s.
>     >
>     > Something else is going on here.
>
>     Tim, does this happen right after boot? I wonder if it's because we're
>     not initializing avg_last_update, and the initial delta between the
>     last update (0) and the first scheduled update (sched_clock() + 2s)
>     ends up bigger than 4 seconds somehow. Later on, the delta between the
>     last and the scheduled update should always be ~2s. But for that to
>     happen, it would require a pretty slow boot, or a sched_clock() that
>     does not start at 0.
>
>     Tim, if you have a coredump, can you extract the value of the other
>     variables printed in the following patch?
>
>     diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>     index 84af7aa158bf..1b6836d23091 100644
>     --- a/kernel/sched/psi.c
>     +++ b/kernel/sched/psi.c
>     @@ -374,6 +374,10 @@ static u64 update_averages(struct psi_group *group, u64 now)
>          */
>         avg_next_update = expires + ((1 + missed_periods) * psi_period);
>         period = now - (group->avg_last_update + (missed_periods * psi_period));
>     +
>     +   WARN(period >> 32, "period=%ld now=%ld expires=%ld last=%ld missed=%ld\n",
>     +        period, now, expires, group->avg_last_update, missed_periods);
>     +
>         group->avg_last_update = now;
>
>         for (s = 0; s < NR_PSI_STATES - 1; s++) {
>
>     And we may need something like this to make the tick initialization
>     more robust regardless of the reported bug here:
>
>     diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>     index 84af7aa158bf..ce8f6748678a 100644
>     --- a/kernel/sched/psi.c
>     +++ b/kernel/sched/psi.c
>     @@ -185,7 +185,8 @@ static void group_init(struct psi_group *group)
>
>         for_each_possible_cpu(cpu)
>                 seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
>     -   group->avg_next_update = sched_clock() + psi_period;
>     +   group->avg_last_update = sched_clock();
>     +   group->avg_next_update = group->avg_last_update + psi_period;
>         INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
>         mutex_init(&group->avgs_lock);
>         /* Init trigger-related members */
>
>
>
>