linux-kernel - sched: How does the scheduler determine which CPU core gets the job?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <57078F83.20405@ts.fujitsu.com>
Date:	Fri, 8 Apr 2016 13:01:23 +0200
From:	Rainer Koenig <Rainer.Koenig@...fujitsu.com>
To:	<linux-kernel@...r.kernel.org>
Subject: sched: How does the scheduler determine which CPU core gets the job?

Short summary:
==============
Investigating an isuue where parallel tasks are spread differently over
the available CPU cores, depending if the machine was cold booted from
power off or warm booted by init 6. On cold boot the parallel processes
were spread as expected so that with "N" cores and "N" tasks every core
gets one task. Same test with warm boot shows that the tasks are spread
differently  which results in a lousy performance.

More details:
=============
Have a workstation here with 2 physical CPUs Intel(R) Xeon(R) CPU
E5-2680 v3 @ 2.50GH which sums up to 48 cores (including hypterthreading).

The test sample is an example from the LIGGHTS tutorial files.

Test is called like that:

mpirun -np 48 liggghts < in.chutewear

The performance and CPU load is monitored with htop.

If I run the test after a cold boot everyting is like I expected it to
be. 48 parallel processes are started, distributed over 48 cores and I
see that every CPU core is working at around 100% load.

Same hardware, same test, only difference is that meanwhile I did a
reboot. Behaviour is totally different. This time only a few CPU cores
get the processes and so many cores are just idling around.

Question that comes to my mind:
===============================
What can cause such a behaviour? Ok, simple answer would be "talk to
your machine vendor and ask them what they have done wrong during
initialization when the system is rebootet". Bad news in that is that
I'm working for that vendor and we need an idea what to look for. After
discussing this on the OpenMPI list I now decided to ask here for help.

What we tried out so far:
=========================

- compared dmesg output betweend cold and warm boot. Nothing special,
  just a few different numbers for computed performance and different
  timestamps.
- compared the output of lstopo from hwloc, but nothing special here
  too.
- wrote a script that make a snapshot of all /proc/<pid>/status files
  for the liggghts jobs and compared the snapshots. Now its clear that
  we still launch 48 processes, but they are distributed differently.
- tried newer kernel (test is running on Ubuntu 14.04.4). Performance
  got a bit better, but problem still exists.
- Took snapshots of /proc/sched_debug when test is running after cold
  or warm boot. Problem is that for interpreting this output I would
  need the details how the scheduler works. But that's why I'm asking
  here.

So, if anyone has an idea what to look for please post it here and add
me to Cc:

TIA
Rainer
-- 
Dipl.-Inf. (FH) Rainer Koenig
Project Manager Linux Clients
FJ EMEIA PR PSO PM&D CCD ENG SW OSS&C

Fujitsu Technology Solutions
Bürgermeister-Ullrich-Str. 100
86199 Augsburg
Germany

Telephone: +49-821-804-3321
Telefax:   +49-821-804-2131
Mail:      mailto:Rainer.Koenig@...fujitsu.com

Internet         ts.fujtsu.com
Company Details  ts.fujitsu.com/imprint.html