linux-kernel - Re: [External] Re: Fwd: WARNING: CPU: 13 PID: 3837105 at kernel/sched/sched.h:1561 __cfsb_csd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dd8e6bfb-9d84-6a5d-94cb-4833f5d1943b@bytedance.com>
Date:   Fri, 8 Sep 2023 11:28:34 +0800
From:   Hao Jia <jiahao.os@...edance.com>
To:     Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Benjamin Segall <bsegall@...gle.com>,
        Bagas Sanjaya <bagasdotme@...il.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Igor Raits <igor.raits@...il.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux Regressions <regressions@...ts.linux.dev>,
        Linux Stable <stable@...r.kernel.org>
Subject: Re: [External] Re: Fwd: WARNING: CPU: 13 PID: 3837105 at
 kernel/sched/sched.h:1561 __cfsb_csd_unthrottle+0x149/0x160



On 2023/9/8 Tim Chen wrote:
> On Thu, 2023-09-07 at 16:59 +0800, Hao Jia wrote:
>>
>> On 2023/9/5 Peter Zijlstra wrote:
>>> On Thu, Aug 31, 2023 at 04:48:29PM +0800, Hao Jia wrote:
>>>
>>>> If I understand correctly, rq->clock_update_flags may be set to
>>>> RQCF_ACT_SKIP after __schedule() holds the rq lock, and sometimes the rq
>>>> lock may be released briefly in __schedule(), such as newidle_balance(). At
>>>> this time Other CPUs hold this rq lock, and then calling
>>>> rq_clock_start_loop_update() may trigger this warning.
>>>>
>>>> This warning check might be wrong. We need to add assert_clock_updated() to
>>>> check that the rq clock has been updated before calling
>>>> rq_clock_start_loop_update().
>>>>
>>>> Maybe some things can be like this?
>>>
>>> Urgh, aside from it being white space mangled, I think this is entirely
>>> going in the wrong direction.
>>>
>>> Leaking ACT_SKIP is dodgy as heck.. it's entirely too late to think
>>> clearly though, I'll have to try again tomorrow.
> 
> I am trying to understand why this is an ACT_SKIP leak.
> Before call to __cfsb_csd_unthrottle(), is it possible someone
> else lock the runqueue, set ACT_SKIP and release rq_lock?
> And then that someone never update the rq_clock?
> 

Yes, we want to set rq->clock_update_flags to RQCF_ACT_SKIP to avoid 
updating the rq clock multiple times in __cfsb_csd_unthrottle().

But now we find ACT_SKIP leak, so we cannot unconditionally set 
rq->clock_update_flags to RQCF_ACT_SKIP in rq_clock_start_loop_update().


>>
>> Hi Peter,
>>
>> Do you think this fix method is correct? Or should we go back to the
>> beginning and move update_rq_clock() from unthrottle_cfs_rq()?
>>
> If anyone who locked the runqueue set ACT_SKIP also will update rq_clock,
> I think your change is okay.  Otherwise rq_clock could be missing update.
> 
> Thanks.
> 
> Tim