linux-kernel - Re: NUMA allocator on Opteron systems does non-local allocation on node0

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <5555373.1223911646283.SLOX.WebMail.wwwrun@exchange.deltacomputer.de>
Date:	Mon, 13 Oct 2008 17:27:26 +0200 (CEST)
From:	Oliver Weihe <o.weihe@...tacomputer.de>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: NUMA allocator on Opteron systems does non-local allocation on node0

Hello,

it seems that my reproducer is not very good. :(
It "works" much better when you start several processes at once.

for i in `seq 0 3`
do
  numactl --cpunodebind=${i} ./app &
done
wait

"app" still allocates some memory (7GiB per process) and fills the array
with data.

I've noticed this behaviour during some HPL (Linpack benchmark from/for
top500.org) runs. For small data sets there's no difference in speed
between the kernels while for big data sets (allmost the whole memory)
2.6.23 and newer kernels are slower than 2.6.22.
I'm using OpenMPI with the runtime option "--mca mpi_paffinity_alone 1"
to pin each process on a specific CPU.

The bad news is: I can crash allmost every Quadcore Opteron system with
kernels 2.6.21.x to 2.6.24.x with "parallel memory allocation and
filling the memory with data" (parallel means: there is one process per
core doing this). While it takes some time on dualsocket machines it
takes often less than 1 minute on quadsocket quadcores until the system
freezes.
Yust for the case it is some vendor specific BIOS bug: we're using
supermicro mainboards.

> [Another copy of the reply with linux-kernel added this time]
> 
> > In my setup I'm allocating an array of ~7GiB memory size in a
> > singlethreaded application.
> > Startup: numactl --cpunodebind=X ./app
> > For X=1,2,3 it works as expected, all memory is allocated on the
> > local
> > node.
> > For X=0 I can see the memory beeing allocated on node0 as long as
> > ~3GiB
> > are "free" on node0. At this point the kernel starts using memory
> > from
> > node1 for the app!
> 
> Hmm, that sounds like it doesn't want to use the 4GB DMA zone.
> 
> Normally there should be no protection on it, but perhaps something 
> broke.
> 
> What does cat /proc/sys/owmem_reserve_ratio say?

2.6.22.x:
# cat /proc/sys/vm/lowmem_reserve_ratio
256     256

2.6.23.8 (and above)
# cat /proc/sys/vm/lowmem_reserve_ratio
256     256     32

> > For parallel realworld apps I've seen a performance penalty of 30%
> > compared to older kernels!
> 
> Compared to what older kernels? When did it start?

I've tested some kernel Versions that I've laying around here...
working fine: 2.6.22.18-0.2-default (openSUSE) / 2.6.22.9 (kernel.org) 
showing the described behaviour: 2.6.23.8; 2.6.24.4; 2.6.25.4; 2.6.26.5;
2.6.27

> 
> -Andi
> 
> -- 
> ak@...ux.intel.com
> 

-- 

Regards,
Oliver Weihe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/