linux-kernel - PostgreSQL pgbench performance regression in 2.6.23+

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.GSO.4.64.0805211115390.8451@westnet.com>
Date:	Wed, 21 May 2008 13:34:56 -0400 (EDT)
From:	Greg Smith <gsmith@...gsmith.com>
To:	lkml <linux-kernel@...r.kernel.org>
Subject: PostgreSQL pgbench performance regression in 2.6.23+

PostgreSQL ships with a simple database benchmarking tool named pgbench, 
in what's labeled the contrib section (in many distributions it's a 
separate package from the main server/client ones).  I see there's been 
some work done already improving how the PostgreSQL server works under the 
new scheduler (the "Poor PostgreSQL scaling on Linux 2.6.25-rc5" thread). 
I wanted to provide you a different test case using pgbench that has taken 
a sharp dive starting with 2.6.23, and the server improvement changes in 
2.6.25 actually made this problem worse.

I think it will be easy for someone else to replicate my results and I'll 
go over the exact procedure below.  To start with a view of how bad the 
regression is, here's a summary of the results on one system, an AMD X2 
4600+ running at 2.4GHz, with a few interesting kernels.  I threw in 
results from Solaris 10 on this system as a nice independant reference 
point.  The numbers here are transactions/second (TPS) running a simple 
read-only test over a 160MB data set, I took the median from 3 test runs:

Clients	2.6.9	2.6.22	2.6.24	2.6.25	Solaris
1	11173	11052	10526	10700	9656
2	18035	16352	14447	10370	14518
3	19365	15414	17784	9403	14062
4	18975	14290	16832	8882	14568
5	18652	14211	16356	8527	15062
6	17830	13291	16763	9473	15314
8	15837	12374	15343	9093	15164
10	14829	11218	10732	9057	14967
15	14053	11116	7460	7113	13944
20	13713	11412	7171	7017	13357
30	13454	11191	7049	6896	12987
40	13103	11062	7001	6820	12871
50	12311	11255	6915	6797	12858

That's the CentOS 4 2.6.9 kernel there, while the rest are stock ones I 
compiled with a minimum of fiddling from the defaults (just adding support 
for my SATA RAID card).  You can see a major drop with the recent kernels 
at high client loads, and the changes in 2.6.25 seem to have really hurt 
even the low client count ones.

The other recent hardware I have here, an Intel Q6600 based system, gives 
even more maddening results.  On successive benchmark runs, you can watch 
it break down only sometimes once you get just above 8 clients.  At 10 and 
15 clients, when I run it a few times, I'll sometimes get results in the 
good 25-30K TPS range, while others will give the 10K slow case.  It's not 
a smooth drop off like in the AMD case, the results from 10-15 are really 
unstable.  I've attached some files with 5 quick runs at each client load 
so you can see what I'm talking about.  On that system I was also able to 
test 2.6.26-rc2 which doesn't look all that different from 2.6.25.

All these results are running everything on the server using the default 
local sockets-based interface, which is relevant in the real world because 
that's how a web app hosted on the same system will talk to the database. 
If I switch to connecting to the database over TCP/IP and run the pgbench 
client on another system, the extra latency drops the single client case 
to ~3100TPS.  But the high client load cases are great--about 26K TPS at 
50 clients.  That result is attached as q6600-remote-2.6.25.txt, the 
remote client was running 2.6.20.  Since recent PostgreSQL results were 
also fine with sysbench as the benchmark driver, this suggests the problem 
here is actually related to the pgbench client itself and how it gets 
scheduled relative to the server backends, rather than being inherent to 
the server.

Replicating the test results
----------------------------

Onto replicating my results, which I hope works because I don't have too 
much time to test potential fixed kernels myself (I was supposed to be 
working on the PostgreSQL code until this sidetracked me).  I'll assume 
you can get the basic database going, if anybody needs help with that let 
me know.  There is one server tunable that needs to be adjusted before you 
can get useful PostgreSQL benchmarks from this (and many other) tests. 
In the root of the database directory, there will be a file named 
postgresql.conf.  Edit that and changed the setting for the shared_buffers 
parameter to 256MB to mimic my test setup.  You may need to bump up shmmax 
(this is the one list where I'm happy I don't have to explain what that 
means!).  Restart the server and check the logs to make sure it came back 
up, if shmmax is too low it will just tell you how big it needs to be and 
not start.

Now the basic procedure to run this test is:

-dropdb pgbench (if it's already there)
-createdb pgbench
-pgbench -i -s 10 pgbench       (makes about a 160MB database)
-pgbench -S -c <clients> -t 10000 pgbench

The idea is that you'll have a large enough data set to not fit in L2 
cache, but small enough that it all fits in PostgreSQL's dedicated memory 
(shared_buffers) so that it never has to ask the kernel to read a block. 
The "pgbench -i" initialization step will populate the server's memory and 
while that's all written to disk, it should stay in memory afterwards as 
well.  That's why I use this as a general CPU/L2/memory test as viewed 
from a PostgreSQL context, and as you can see from my results with this 
problem it's pretty sensitive to whether your setup is optimal or not.

To make this easier to run against a range of client loads, I've attached 
a script (selecttest.sh) that does the last two steps in the above. 
That's what I used to generate all the results I've attached.  If you've 
got the database setup such that you can run the psql client and pgbench 
is in your path, you should just be able to run that script and have it 
give you a set of results in a couple of minutes.  You can adjust which 
client loads and how many times it runs each by editing the script.

Addendum:  how pgbench works
----------------------------

pgbench works off "command scripts", which are a series of SQL commands 
with some extra benchmarking features implemented as a really simple 
programming language.  For example, the SELECT-only test run above, what 
you get when passing -S to pgbench, is implemented like this:

\\set naccounts 100000 * :scale
\\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid = :aid;

Here :scale is detected automatically by doing a count of a table in the 
database.

The pgbench client runs as a single process.  When pgbench starts, it 
iterates over each client, parsing the script until it hits a line that 
needs to be sent to the server.  At that point, it issues that command as 
an asynchronous request, then returns to the main loop.  Once every client 
is primed with a command, it enters a loop where it just waits for 
responses from them.

The main loop has all the open client connections in a fd_set.  Each time 
a select() on that set says there's been a response to at least one of the 
clients from the server, it sweeps through all the clients and feeds the 
next script line to any that are ready for one.  This proceeds until the 
target transaction count is reached.

This design is recognized as being only useful for smallish client loads. 
The results start dropping off very hard even on a fast machine with >100 
simulated clients as the single pgbench process struggles to respond to 
everyone who is ready on each pass through all the clients who got 
responses.  This makes pgbench particularly unsuitable for testing on 
systems with a large number of CPUs.  I find pgbench just can't keep up 
with the useful number of clients possible somewhere between 8 and 16 
cores.  I'm hoping the PostgreSQL community can rewrite it in a more 
efficient way before the next release comes out now that such hardware is 
starting to show up more running this database.  If that's the only way to 
resolve the issue outlined in this message, that's not intolerable, but a 
kernel fix would obviously be better.

I wanted to submit this here regardless because I'd really like for 
current versions to not have a big regression just because they were using 
a newer kernel, and it provides an interesting scheduler test case to add 
to the mix.  The fact that earlier Linux kernels and alternate ones like 
Solaris give pretty consistant results here says this programming approach 
isn't impossible for a kernel to support well, I just don't think this 
specific type of load has been considered in the test cases for the new 
scheduler yet.

--
* Greg Smith gsmith@...gsmith.com http://www.gregsmith.com Baltimore, MD
View attachment "selecttest.sh" of type "TEXT/PLAIN" (503 bytes)

View attachment "q6600-remote-2.6.25.txt" of type "TEXT/PLAIN" (711 bytes)

View attachment "q6600-results-2.6.25.txt" of type "TEXT/PLAIN" (1064 bytes)

View attachment "q6600-results-2.6.24.txt" of type "TEXT/PLAIN" (1069 bytes)

View attachment "q6600-results-2.6.22.txt" of type "TEXT/PLAIN" (1085 bytes)

View attachment "q6600-results-2.6.26-rc2.txt" of type "TEXT/PLAIN" (1084 bytes)