linux-kernel - Re: 5.6-rc3: WARNING: CPU: 48 PID: 17435 at kernel/sched/fair.c:380 enqueue_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <2108173c-beaa-6b84-1bc3-8f575fb95954@de.ibm.com>
Date:   Wed, 4 Mar 2020 18:42:27 +0100
From:   Christian Borntraeger <borntraeger@...ibm.com>
To:     Vincent Guittot <vincent.guittot@...aro.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: 5.6-rc3: WARNING: CPU: 48 PID: 17435 at kernel/sched/fair.c:380
 enqueue_task_fair+0x328/0x440



On 04.03.20 16:26, Vincent Guittot wrote:
> On Tue, 3 Mar 2020 at 08:55, Vincent Guittot <vincent.guittot@...aro.org> wrote:
>>
>> On Tue, 3 Mar 2020 at 08:37, Christian Borntraeger
>> <borntraeger@...ibm.com> wrote:
>>>
>>>
>>>
> [...]
>>>>>> ---
>>>>>>  kernel/sched/fair.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index 3c8a379c357e..beb773c23e7d 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -4035,8 +4035,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>>>>>             __enqueue_entity(cfs_rq, se);
>>>>>>     se->on_rq = 1;
>>>>>>
>>>>>> +   list_add_leaf_cfs_rq(cfs_rq);
>>>>>>     if (cfs_rq->nr_running == 1) {
>>>>>> -           list_add_leaf_cfs_rq(cfs_rq);
>>>>>>             check_enqueue_throttle(cfs_rq);
>>>>>>     }
>>>>>>  }
>>>>>
>>>>> Now running for 3 hours. I have not seen the issue yet. I can tell tomorrow if this fixes
>>>>> the issue.
>>>>
>>>>
>>>> Still running fine. I can tell for sure tomorrow, but I have the impression that this makes the
>>>> WARN_ON go away.
>>>
>>> So I guess this change "fixed" the issue. If you want me to test additional patches, let me know.
>>
>> Thanks for the test. For now, I don't have any other patch to test. I
>> have to look more deeply how the situation happens.
>> I will let you know if I have other patch to test
> 
> So I haven't been able to figure out how we reach this situation yet.
> In the meantime I'm going to make a clean patch with the fix above.
> 
> Is it ok if I add a reported -by and a tested-by you ?

Sure-
I just realized that this system has something special. Some month ago I created 2 slices
$ head /etc/systemd/system/*.slice
==> /etc/systemd/system/machine-production.slice <==
[Unit]
Description=VM production
Before=slices.target
Wants=machine.slice
[Slice]
CPUQuota=2000%
CPUWeight=1000

==> /etc/systemd/system/machine-test.slice <==
[Unit]
Description=VM production
Before=slices.target
Wants=machine.slice
[Slice]
CPUQuota=300%
CPUWeight=100


And the guests are then put into these slices. that also means that this test will never use more than the 2300%.
No matter how much CPUs the system has.