lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <50253205-08d9-5ff4-98a9-3aa3bc669a75@arm.com>
Date:   Tue, 11 Jan 2022 13:37:30 +0100
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, linux-kernel@...r.kernel.org,
        rickyiu@...gle.com, odin@...d.al, sachinp@...ux.vnet.ibm.com,
        naresh.kamboju@...aro.org
Subject: Re: [PATCH v2 1/3] sched/pelt: Don't sync hardly util_sum with
 uti_avg

On 11/01/2022 08:54, Vincent Guittot wrote:
> On Fri, 7 Jan 2022 at 16:21, Vincent Guittot <vincent.guittot@...aro.org> wrote:
>>
>> On Fri, 7 Jan 2022 at 12:43, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>>
>>> On 05/01/2022 14:57, Vincent Guittot wrote:
>>>> On Wed, 5 Jan 2022 at 14:15, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>>>>
>>>>> On 04/01/2022 14:42, Vincent Guittot wrote:
>>>>>> On Tue, 4 Jan 2022 at 12:47, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>>>>>>>
>>>>>>> On 22/12/2021 10:38, Vincent Guittot wrote:

[...]

>>>> Some of the changes done in PELT signal propagation that replace
>>>> subtracting util_sum  by using util_avg * divider instead, are related
>>>> to other problems with sched group hierarchy and
>>>> throttling/unthrottling. I'm not 100% confident that using fixes tag
>>>> to backport this on stables doesn't need to backport more patches on
>>>> other areas in order to not resurrect old problems. So I wonder if
>>>> it's worth  backporting this on stables
>>>
>>> OK, I see. So only 1c35b07e6d39 (i.e. the util _sum/_avg change in
>>> update_cfs_rq_load_avg() (1)) caused the CPU frequency regression. That
>>> was the reason why I initially suggested to split the patch-set
>>> differently. But you said that you saw the issue also when (1) is fixed.
>>
>> Ok, I think that we were not speaking about the same setup. I wrongly
>> read that you were saying that
>> sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
>> was only needed in update_cfs_rq_load_avg() but not in the other places.
>>
>> But what you said is that we only need the below to fix the perf
>> regression raised by rick ?
>>      r = removed_util;
>>      sub_positive(&sa->util_avg, r);
>>  -   sa->util_sum = sa->util_avg * divider;
>>  +   sub_positive(&sa->util_sum, r * divider);
>>  +   sa->util_sum = max_t(u32, sa->util_sum, sa->util_avg * MIN_DIVIDER);
> 
> The test with the code above doesn't trigger any SCHEd_WARN over the
> weekend so it's probably ok to make a dedicated patch for this with
> tag.
> I'm going to prepare a v2

Yes, `sa->X_sum = max_t(u32, sa->X_sum, sa->X_avg * MIN_DIVIDER)` is
needed for all 3 X = [load, runnable, util] in  update_cfs_rq_load_avg()
to not hit the  SCHED_WARN_ON() in cfs_rq_is_decayed().

>> The WARN that I mentioned in my previous email was about not adding
>> the max_t in all 3 places. I rerun some test today and I triggered the
>> WARN after a detach without the max_t line.
>>
>> I can probably isolate the code above in a dedicated patch for the
>> regression raised by Rick and we could consider adding a fixes tag; I
>> will run more tests with only this part during the weekend.
>> That being said, we need to stay consistent in all 3 places where we
>> move or propagate some *_avg. In particular, using  "sa->util_sum =
>> sa->util_avg * divider" has the drawback of filtering the contribution
>> not already accounted for in util_avg and the impact is much larger
>> than expected. It means that  although fixing only
>> update_cfs_rq_load_avg() seems enough for rick's case, some other
>> cases could be impacted by the 2 other places and we need to fixed
>> them now that we have a better view of the root cause

Agreed.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ