linux-kernel - Re: rq lock contention due to commit af7f588d8f73

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b5e09943-36e6-c89b-4701-5af6408223e8@efficios.com>
Date:   Mon, 27 Mar 2023 09:20:44 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: Re: rq lock contention due to commit af7f588d8f73

On 2023-03-27 04:05, Aaron Lu wrote:
> Hi Mathieu,
> 
> I was doing some optimization work[1] for kernel scheduler using a
> database workload: sysbench+postgres and before I submit my work, I
> rebased my patch on top of latest v6.3-rc kernels to see if everything
> still works expected and then I found rq's lock became very heavily
> contended as compared to v6.2 based kernels.
> 
> Using the above mentioned workload, before commit af7f588d8f73("sched:
> Introduce per-memory-map concurrency ID"), the profile looked like:
> 
>       7.30%     0.71%  [kernel.vmlinux]            [k] __schedule
>       0.03%     0.03%  [kernel.vmlinux]            [k] native_queued_spin_lock_slowpath
> 
> After that commit:
> 
>      49.01%     0.87%  [kernel.vmlinux]            [k] __schedule
>      43.20%    43.18%  [kernel.vmlinux]            [k] native_queued_spin_lock_slowpath
> 
> The above profile was captured with sysbench's nr_threads set to 56; if
> I used more thread number, the contention would be more severe on that
> 2sockets/112core/224cpu Intel Sapphire Rapids server.
> 
> The docker image I used to do optimization work is not available outside
> but I managed to reproduce this problem using only publicaly available
> stuffs, here it goes:
> 1 docker pull postgres
> 2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
> 3 go inside the container
>    sudo docker exec -it $the_just_started_container_id bash
> 4 install sysbench inside container
>    sudo apt update and sudo apt install sysbench
> 5 prepare
>    root@...tainer:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
> 6 run
>    root@...tainer:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run
> 
> Let it warm up a little bit and after 10-20s you can do profile and see
> the increased rq lock contention. You may need a machine that has at
> least 56 cpus to see this, I didn't try on other machines.
> 
> Feel free to let me know if you need any other info.

While I setup my dev machine with this reproducer, here are a few
questions to help figure out the context:

I understand that pgsql is a multi-process database. Is it strictly
single-threaded per-process, or does each process have more than
one thread ?

I understand that your workload is scheduling between threads which
belong to different processes. Are there more heavily active threads
than there are scheduler runqueues (CPUs) on your machine ?

When I developed the mm_cid feature, I originally implemented two additional
optimizations:

     Additional optimizations can be done if the spin locks added when
     context switching between threads belonging to different memory maps end
     up being a performance bottleneck. Those are left out of this patch
     though. A performance impact would have to be clearly demonstrated to
     justify the added complexity.

I suspect that your workload demonstrates the need for at least one of those
optimizations. I just wonder if we are in a purely single-threaded scenario
for each process, or if each process has many threads.

Thanks,

Mathieu


> 
> [1]: https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/
> 
> Best wishes,
> Aaron

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com