linux-kernel - RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CH2PR19MB3896AFE1D13AD88A17160860FC700@CH2PR19MB3896.namprd19.prod.outlook.com>
Date:   Fri, 15 Nov 2019 00:43:42 +0000
From:   "Rafikov, Rustem" <Rustem.Rafikov@...l.com>
To:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RT scheduler is suboptimal when an RT thread preempts another RT in
 terms of choosing a core to migrate

Hi,

When an RT thread preempts another RT thread it migrates the latter one to a core. 
The way RT scheduler chooses a core is quite suboptimal. Let me give an example from a "production" server with 32 total physical cores.
There are SCHED_NORMAL threads (affined to particular core each) and 2+ groups of RT threads (allowed to run everywhere). 
Scheduler trace showed that most cases RT scheduler preempts a normal prio thread from a core to put evicted RT one on rather than using an idle core the system had a plenty of which according the trace.

I reproduced the behavior on a vanilla 4.18.0 kernel with a micro test where I created 10 SCHED_NORMAL affined to 10 cores,
3 RT/69 with 0xFFFFFFFF affinity and a few RT/79 threads kicking off other RTs from CPUs every 5 msec. 
Other cores were idle but RT/69 never migrated to them.

The problem seems to be in how mapping in cpupri structure is updated:
1) Fair scheduler does not update/read from there. So we don't know if a SCHED_NORMAL left a cpu. Well, that may be OK.
2) RT scheduler uses cpupri to find a core to migrate to, but it updates it incorrectly:
- RT->RT works fine [2]
- But RT->IDLE or RT->SCHED_NORMAL [1] is not right - in both cases it sets RT_MAX(100) which is min NORMAL!
It's totally okay to set it to RT_MAX for all of NORMALs but not for IDLE. BTW - IDLE means swapper which has pri=120 :)

See below traced with kprobes.

[1] IDLE->RT/79->IDLE
#1. <idle>-0     [001] d.h. 14717592.107294: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0001
#2. <...>-157332 [001] d... 14717592.107313: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=64 oldpri=0051

Decoding the output at #1 cpu=1 newp=14 oldpri=0001
- cpu = 1 - it happens on core 1
- newp=14 - the priority of a thread being scheduled in is 0x14 which is RT-79 (our test thread)
- oldpri=0001 - a priority of previous thread on that CPU. "1" means NORMAL in 0-101 scale. This is incorrect by itself because the core was IDLE!
Let's try to figure out why it is not '0' (IDLE) by looking at the last line - cpu=1 newp=64 oldpri=0051
- newp=64 says that the priority of a thread being scheduled in is 0x64 which min NORMAL. So, it is not 140 how we could expect when switching to IDLE thread.
- oldpri=0051 this is 81 - priority of our RT-79 thread in 0-101 scale


[2] RT/69->RT/79>RT/69
#1. <...>-158253 [001] d.h. 14723119.396120: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0047 #2. <...>-158254 [001] d... 14723119.396122: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=1e oldpri=0051 Line #1 -  "cpu=1 newp=14 oldpri=0047"  switching to 0x14, RT-79 thread
- old pri currently on cpu is 0x47 in 0-101 scale OR RT-69
Line#2 - switching to 0x1e - RT-69. This is correct value of the thread being scheduled in!
- oldppri=0051 - RT-69 in 0-101 scale

Thanks,
Rustem