linux-kernel - RE: Regression in reading /proc/stat in the newer kernels with large SMP and NUMA configurations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9FCE3A46FE7C8045A6207AE4B42E9F9A4EC2E29DD4@GVW1119EXC.americas.hpqcorp.net>
Date:	Fri, 14 Oct 2011 14:36:30 +0100
From:	"Seger, Mark" <mark.seger@...com>
To:	Eric Dumazet <eric.dumazet@...il.com>,
	"Oberman, Laurence (HAS GSE)" <Laurence.Oberman@...com>
CC:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"Cabaniols, Sebastien" <Sebastien.Cabaniols@...com>
Subject: RE: Regression in reading /proc/stat in the newer kernels with
 large SMP and NUMA configurations

> > The HP developed collectl utility depends heavily on abstractions in
> /proc and /sys provided by the kernel for kernel statistics capture.
> > During testing on newer NUMA platforms with large numbers of CPU’S HP
> has observed a significant increase in the time taken to parse this
> data with the newer 2.6.32.x and higher kernels.

While this does indeed effect collectl and renders its effectiveness at 0.1 or lower monitoring intervals it effects ANY utility that reads /proc/stat, and the more frequent the access the heavier the load.  Remember, the increase in load is by a factor of over 50 times, so those who like to run top or vmstat or even iostat will also see a much heavier hand too.  If people ran sar at a lower monitoring frequency it too would suffer.

I've also observed reading /proc/meminfo also consumes a lot more CPU, but not as bad as /proc/stat.

> Strange, here is what I have on my 2x4x2 machine.

Forgive a dumb question, but what does 2X4X2 mean?  ;(

I've observed things are fine on 2 socket systems and are truly horrible on the few 8 socket machines I tested this on which also had higher core counts, so there were always on the order of 48+ cpus involved.  I haven't done any testing on 4 socket machines and I don't even want to think about what could happen on 16 or 32 or however many socket machines are built in the future.

> Could you try following actions please ?
> 
> # perf record ./prog
> Opened, read and closed 8640 times and read total of 29595948 bytes
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.057 MB perf.data (~2482 samples) ]
> 
> # perf report --stdio
> # Events: 1K cycles
> #
> # Overhead  Command      Shared Object                        Symbol
> # ........  .......  .................  ............................
> #
>     22.82%     prog  [kernel.kallsyms]  [k] format_decode
>     21.15%     prog  [kernel.kallsyms]  [k] vsnprintf
>     13.46%     prog  [kernel.kallsyms]  [k] number.isra.4
>     11.33%     prog  [kernel.kallsyms]  [k] memcpy
>      8.87%     prog  [kernel.kallsyms]  [k] kstat_irqs
>      3.80%     prog  [kernel.kallsyms]  [k] radix_tree_lookup_element
>      3.69%     prog  [kernel.kallsyms]  [k] seq_printf
>      3.63%     prog  [kernel.kallsyms]  [k] find_next_bit
>      2.81%     prog  [kernel.kallsyms]  [k] show_stat
>      1.18%     prog  [kernel.kallsyms]  [k] put_dec_trunc
>      0.96%     prog  [kernel.kallsyms]  [k] radix_tree_lookup
>      0.96%     prog  [kernel.kallsyms]  [k] irq_to_desc
>      0.66%     prog  [kernel.kallsyms]  [k] getname_flags
>      0.37%     prog  [kernel.kallsyms]  [k]
> selinux_inode_alloc_security
>      0.22%     prog  [kernel.kallsyms]  [k] jiffies_64_to_clock_t
>      0.22%     prog  [kernel.kallsyms]  [k] copy_user_generic_string
>      0.22%     prog  [kernel.kallsyms]  [k] __call_rcu
>      0.15%     prog  [kernel.kallsyms]  [k] add_preempt_count
>      0.15%     prog  [kernel.kallsyms]  [k] put_dec_full
>      0.15%     prog  [kernel.kallsyms]  [k] nr_context_switches
>      0.15%     prog  [kernel.kallsyms]  [k] kfree

For me the easiest reproducer, which admittedly doesn't show any deeper analysis, is to just run "cat /proc/stat>/dev/null" in a loop and time it.

-mark