linux-kernel - Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4896FFFE.7080400@sgi.com>
Date:	Mon, 04 Aug 2008 06:11:26 -0700
From:	Stephen Champion <schamp@....com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
CC:	Robin Holt <holt@....com>, linux-kernel@...r.kernel.org,
	Pavel Emelyanov <xemul@...nvz.org>,
	Oleg Nesterov <oleg@...sign.ru>,
	Sukadev Bhattiprolu <sukadev@...ibm.com>,
	Paul Menage <menage@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [Patch] Scale pidhash_shift/pidhash_size up based on num_possible_cpus().

Eric W. Biederman wrote:
> Robin Holt <holt@....com> writes:
>> Oops, confusing details.  That was a different problem we had been
>> tracking.
> 
> Which leads back to the original question.  What were you measuring
> that showed improvement with a larger pid hash size?
> 
> Almost by definition a larger hash table will perform better.  However
> my intuition is that we are talking about something that should be in
> the noise for most workloads.

Robin asked me to chime in on this, as I did the early "look at that" 
work and suggested it to Robin.

I noticed the potential for increasing pid_shift while chasing down a 
patch to our kernel (2.6.16 stable based) which had proc_pid_readdir() 
calling find_pid() for init_task through the highest pid #.  This patch 
caused a rather serious problem on a 2048 core Altix.  Before 
identifying the culprit, I increased pidhash_shift.  This made a *huge* 
difference: enough to get the box marginally functional while I tracked 
down the origins of the problem.

After backing out the problematic patch, I took a look at pidhash_shift 
in normal circumstances:  With pidhash_shift == 12, running only a few 
common services and monitoring tools (sendmail, nagios, etc for ~28k 
active processes, mostly of the kernel variety), the 20 cpu boot cpuset 
we use on that system to confine normal system processes and interactive 
logins was spending >1% of it's time in find_pid(), and an 'ls /proc > 
/dev/null' took >0.4s.  With pidhash_shift == 16, the timing went to 
<0.2, and the total time spent in find_pid() was reduced to noise level.

In addition to raising the limit on larger systems, it looked reasonable 
to scale the pid hash with the # processors instead of memory.  While I 
observed variably high process:cpu ratios on small systems (2c - 32c), 
they also have relatively few processes.  The 192c - 2048c systems I was 
able to look at were all hovering at 13 +/- 2 processes per cpu, even 
with wildly varying memory sizes.

Despite more recent changes in proc_pid_readdir, my results should apply 
to current source.  It looks like both the old 2.6.16 implementation and 
the current version will call find_pid (or equivalent) once for each 
successive getdents() call on /proc, excepting when the cursor is on the 
first entry.  A quick look, and we have 88 getdents64() calls both  'ps' 
and 'ls /proc' with 29k processes running, which appears to be the 
primary source of calls.

It's not giganormous, although I probably could come up with a pointless 
microbenchmark to show it's 300% better.  Importantly, it does 
noticeably improve normal interactive tools like 'ps' and 'top', a 
performance visualization tool developed by a customer (nodemon) 
refreshes faster.  For a 512k init allocation, that seems like a very 
good deal.

I'd like to lose 20,000 kernel processes in addition to growing the pid 
hash!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/