linux-kernel - Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1553866527-18879-1-git-send-email-jdesfossez@digitalocean.com>
Date:   Fri, 29 Mar 2019 09:35:27 -0400
From:   Julien Desfossez <jdesfossez@...italocean.com>
To:     Subhra Mazumdar <subhra.mazumdar@...cle.com>
Cc:     Julien Desfossez <jdesfossez@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, tim.c.chen@...ux.intel.com,
        torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Vineeth Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>
Subject: Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access

On Fri, Mar 22, 2019 at 8:09 PM Subhra Mazumdar <subhra.mazumdar@...cle.com>
wrote:
> Is the core wide lock primarily responsible for the regression? I ran
> upto patch
> 12 which also has the core wide lock for tagged cgroups and also calls
> newidle_balance() from pick_next_task(). I don't see any regression.  Of
> course
> the core sched version of pick_next_task() may be doing more but
> comparing with
> the __pick_next_task() it doesn't look too horrible.

On further testing and investigation, we also agree that spinlock contention
is not the major cause for the regression, but we feel that it should be one
of the major contributing factors to this performance loss.

To reduce the scope of the investigation of the performance regression, we
designed a couple of smaller test cases (compared to big VMs running complex
benchmarks) and it turns out the test case that is most impacted is a simple
disk write-intensive case (up to 99% performance drop). CPU-intensive and
scheduler-intensive tests (perf bench sched) behave pretty well.

On the same server we used before (2x18 cores, 72 hardware threads), with
all the non-essential services disabled, we setup a cpuset of 4 cores (8
hardware threads) and ran sysbench fileio on a dedicated drive (no RAID).
With sysbench running with 8 threads in this cpuset without core scheduling,
we get about 155.23 MiB/s in sequential write. If we enable the tag, we drop
to 0.25 MiB/s. Interestingly, even with 4 threads, we see the same kind of
performance drop.

Command used:

sysbench --test=fileio prepare
cgexec -g cpu,cpuset:test sysbench --threads=4 --test=fileio \
--file-test-mode=seqwr run

If we run this with the data in a ramdisk instead of a real drive, we don’t
notice any drop. The amount of performance drops depends a bit depending on
the machine, but it’s always significant.

We spent a lot of time in the trace and noticed that a couple times during
every run, the sysbench worker threads are waiting for IO sometimes up to 4
seconds, all the threads wait for the same duration, and during that time we
don’t see any block-related softirq coming in. As soon as the interrupt is
processed, sysbench gets woken up immediately. This long wait never happens
without the core scheduling. So we are trying to see if there is a place
where the interrupts are disabled for an extended period of time. The
irqsoff tracer doesn’t seem to pick it up.

Any thoughts about that ?