linux-kernel - NUMA regression(?) on 32core shanghai

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <4A27C7F8.3010207@itwm.fraunhofer.de>
Date:	Thu, 04 Jun 2009 15:11:20 +0200
From:	Martin Vogt <vogt@...m.fraunhofer.de>
To:	linux-kernel@...r.kernel.org
Subject: NUMA regression(?) on 32core shanghai


Hello,

I have strange/unexpected benchmark results for my numa machine
a 32 cores shanghai system with 512GB RAM.


My benchmark shows varying runtimes up to factor 12(!) for identical
tests and I think this is a bug somewhere.

I have tested the following kernels:

-2.6.30-rc8,2.6.29.4 and SLES10-SP1 kernel

All have the same problems for 16/32 threads in the first run.
(but not always!)
For example 2.6.30-rc8:

16-1:           33.403038s      28.906326s <<-- strange values
16-2:           5.444921s       5.072422s
16-3:           6.266797s       6.152743s


This is why I think this is a bug:
----------------------------------

My understanding of the NUMA memory bandwitdh test is:

- if I attach 8 threads to one numa node
- and allocate for each thread 512MB local memory

THEN:
- the runtime should be near constant over all nodes for all runs
  (for example: every thread runs 3 seconds)


If I now double the threads (16 threads, 2 on each numa node)
then:
- the the runtime should double too.
  (for example: 6 seconds instead of three)

and so on, for 32 threads 12 seconds etc...

The machine behaves sometimes as expected, but for the
16/32 threads case it usually has these strange runtimes in the first run.
(But this can happen for the 8 thread test too)

What is wrong with this?
(a factor of 12 slower for old kernels, and factor ~4 for newer)

There must be something wrong with this.
How can I debug it?


regards,

Martin

PS: on a smaller opteron numa system 4 nodes a 2 cores with
8GB on each node the test program works as expected.

PPS: the "bug" does not happens always, but very often with 16/32 threads
and: the behaviour is the same if I replace numa_alloc_onnode with malloc

Benchmark:
- cron is off/HZ is 100/libc 2.4-31.43.7 from SLES10
- Format example:
  08-1:           3.405676        3.023264
  8 threads, first run, read took 3.4 seconds and write 3.0 secs.


2.6.30-rc8
=====================
04-1:           3.591044        3.295444
04-2:           3.588437        3.280143
04-3:           3.448116        2.995627

08-1:           4.122432        3.566830
08-2:           4.119241        3.548015
08-3:           3.819517        3.349197

16-1:           33.403038       28.906326 <<-- strange values
16-2:           5.444921        5.072422
16-3:           6.266797        6.152743

32-1:           49.885150       76.500259 <<-- strange values
32-2:           19.114738       12.170802
32-3:           14.807441       11.064564


2.6.29.4
==================
04-1:           3.375012        3.057332
04-2:           3.401835        3.039497
04-3:           3.359395        2.980974

08-1:           3.405676        3.023264
08-2:           3.257743        3.000751
08-3:           3.129684        2.886261

16-1:           22.417126       11.807065 <<-- strange values
16-2:           6.031583        5.098305
16-3:           5.088144        5.457238

32-1:           45.829553       24.225427 <<-- strange values
32-2:           13.165044       12.290732
32-3:           8.908012        11.622502

2.6.16 (SuSE SLES10-SP1)+perfctr
================================
(Seconds:  it was take the slowest thread)


#Thread-run     read in secs    write in secs
04-1:           3.375012        3.057332
04-2:           3.401835        3.039497
04-3:           3.359395        2.980974

08-1:           3.405676        3.023264
08-2:           3.257743        3.000751
08-3:           3.129684        2.886261

16-1:           74.399871       12.747340 <<-- strange values
16-2:           7.449596        4.401576
16-3:           6.123250        5.518968

32-1:           150.927981      55.032012 <<-- strange values
32-2:           12.119996       12.203303
32-3:           11.601377       12.485716



Download attachment "mbind2.cpp.gz" of type "application/x-gzip" (1779 bytes)