[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CY4PR04MB3751B1FBF2513CCF35B48A75E7690@CY4PR04MB3751.namprd04.prod.outlook.com>
Date: Mon, 6 Jul 2020 02:57:37 +0000
From: Damien Le Moal <Damien.LeMoal@....com>
To: Nobuto Murata <nobuto.murata@...onical.com>,
Ignat Korchagin <ignat@...udflare.com>,
"agk@...hat.com" <agk@...hat.com>,
"snitzer@...hat.com" <snitzer@...hat.com>,
"dm-devel@...hat.com" <dm-devel@...hat.com>,
"dm-crypt@...ut.de" <dm-crypt@...ut.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "herbert@...dor.apana.org.au" <herbert@...dor.apana.org.au>,
"kernel-team@...udflare.com" <kernel-team@...udflare.com>,
"ebiggers@...nel.org" <ebiggers@...nel.org>,
"mpatocka@...hat.com" <mpatocka@...hat.com>
Subject: Re: [dm-devel] [PATCH v2] dm crypt: add flags to optionally bypass
dm-crypt workqueues
On 2020/07/06 11:28, Nobuto Murata wrote:
> On 6/27/20 6:03 AM, Ignat Korchagin wrote:
>> This is a follow up from [1]. Consider the following script:
>>
>> sudo modprobe brd rd_nr=1 rd_size=4194304
>>
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0' | \
>> sudo dmsetup create eram0
>>
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_write_workqueue' | \
>> sudo dmsetup create eram0-inline-write
>>
>> echo '0 8388608 crypt capi:ecb(cipher_null) - 0 /dev/ram0 0 1 no_read_workqueue' | \
>> sudo dmsetup create eram0-inline-read
>>
>> devices="/dev/ram0 /dev/mapper/eram0 /dev/mapper/eram0-inline-read "
>> devices+="/dev/mapper/eram0-inline-write"
>>
>> for dev in $devices; do
>> echo "reading from $dev"
>> sudo fio --filename=$dev --readwrite=read --bs=4k --direct=1 \
>> --loops=1000000 --runtime=3m --name=plain | grep READ
>> done
>>
>> for dev in $devices; do
>> echo "writing to $dev"
>> sudo fio --filename=$dev --readwrite=write --bs=4k --direct=1 \
>> --loops=1000000 --runtime=3m --name=plain | grep WRITE
>> done
>>
>> This script creates a ramdisk (to eliminate hardware bias in the benchmark) and
>> three dm-crypt instances on top. All dm-crypt instances use the NULL cipher
>> to eliminate potentially expensive crypto bias (the NULL cipher just uses memcpy
>> for "encyption"). The first instance is the current dm-crypt implementation from
>> 5.8-rc2, the two others have new optional flags enabled, which bypass kcryptd
>> workqueues for reads and writes respectively and write sorting for writes. On
>> my VM (Debian in VirtualBox with 4 cores on 2.8 GHz Quad-Core Intel Core i7) I
>> get the following output (formatted for better readability):
>>
>> reading from /dev/ram0
>> READ: bw=508MiB/s (533MB/s), 508MiB/s-508MiB/s (533MB/s-533MB/s), io=89.3GiB (95.9GB), run=180000-180000msec
>>
>> reading from /dev/mapper/eram0
>> READ: bw=80.6MiB/s (84.5MB/s), 80.6MiB/s-80.6MiB/s (84.5MB/s-84.5MB/s), io=14.2GiB (15.2GB), run=180000-180000msec
>>
>> reading from /dev/mapper/eram0-inline-read
>> READ: bw=295MiB/s (309MB/s), 295MiB/s-295MiB/s (309MB/s-309MB/s), io=51.8GiB (55.6GB), run=180000-180000msec
>>
>> reading from /dev/mapper/eram0-inline-write
>> READ: bw=114MiB/s (120MB/s), 114MiB/s-114MiB/s (120MB/s-120MB/s), io=20.1GiB (21.5GB), run=180000-180000msec
>>
>> writing to /dev/ram0
>> WRITE: bw=516MiB/s (541MB/s), 516MiB/s-516MiB/s (541MB/s-541MB/s), io=90.7GiB (97.4GB), run=180001-180001msec
>>
>> writing to /dev/mapper/eram0
>> WRITE: bw=40.4MiB/s (42.4MB/s), 40.4MiB/s-40.4MiB/s (42.4MB/s-42.4MB/s), io=7271MiB (7624MB), run=180001-180001msec
>>
>> writing to /dev/mapper/eram0-inline-read
>> WRITE: bw=38.9MiB/s (40.8MB/s), 38.9MiB/s-38.9MiB/s (40.8MB/s-40.8MB/s), io=7000MiB (7340MB), run=180001-180001msec
>>
>> writing to /dev/mapper/eram0-inline-write
>> WRITE: bw=277MiB/s (290MB/s), 277MiB/s-277MiB/s (290MB/s-290MB/s), io=48.6GiB (52.2GB), run=180000-180000msec
>
> I've applied this v2 patch on top of Ubuntu 5.4 kernel and followed the
> test case on my personal testbed (bare metal) with AMD Ryzen 7 2700.
> Indeed it made things faster in general, but I saw mixed results when
> running it with some scenarios close to the actual workloads. I went
> back to the test case and ran it with various blocksize, iodepth, and
> numjobs, then I saw similar and mixed trends. One of the cases where
> no_(read|write)_workqueue doesn't perform well is as follows:
>
> ##################################################
> # 4m ioengine=libaio iodepth=16 numjobs=1
> ##################################################
> reading from /dev/ram0
> READ: bw=6208MiB/s (6510MB/s)
> reading from /dev/mapper/eram0
> READ: bw=4773MiB/s (5005MB/s)
> reading from /dev/mapper/eram0-inline-read
> READ: bw=2782MiB/s (2918MB/s)
> reading from /dev/mapper/eram0-inline-write
> READ: bw=4757MiB/s (4988MB/s)
> writing to /dev/ram0
> WRITE: bw=5497MiB/s (5764MB/s)
> writing to /dev/mapper/eram0
> WRITE: bw=3143MiB/s (3296MB/s)
> writing to /dev/mapper/eram0-inline-read
> WRITE: bw=3144MiB/s (3297MB/s)
> writing to /dev/mapper/eram0-inline-write
> WRITE: bw=1818MiB/s (1906MB/s)
>
> I've collected the output in:
> https://gist.github.com/nobuto-m/74849fb181d9016ddde17c806dac2421#file-patch-v2-testcase_result-txt
Cannot open this link. Certificate problems for https connection...
> Do you see similar trends when blocksize, iodepth, and numjobs are
> changed? Wondered if I should test it again with the latest kernel as
> 5.8-rcX properly instead of 5.4.
We did some tests with the previous version of this patch and also observed
variations based on IO size, queue depth and number of jobs. This is all because
of changes in how CPU resources (crypto HW accelerators) are used. Under high
queue-depth/large IO size workloads, using workqueues tend to achieve better
overall CPU utilization, so improve overall performance (throughput, not latency).
See the attached pictures for our results (using 12x HDDs with a dual socket
Xeon system/20 cores/40 CPUs). The 2 graphs are heat-maps showing the percentage
difference in throughput with patch applied vs unmodified dm-crypt (higher than
100 means better with the patch applied).
Bypassing the workqueues is very efficient for small IO sizes and large number
of jobs as these jobs all get scheduled on different CPUs, leading to parallel
use of the HW crypto accelerators without the added overhead of workqueue
context switches. For a low number of jobs and high queue depth, this
parallelism disappears and performance suffers. For large IO sizes, the same
pattern exists, but performance also suffers for large number of jobs. While we
did not deeply analyzed these results, I think that the loss comes from
inefficient use of HW crypto accelerators resources, that is, lost time forcing
something to be done in context while another context (i.e. another crypto
accelerator) is already finished and idle. The same significant loss of
performance happen with low number of jobs and high QD for the same reason as
for small IOs.
These tests use actual HW. Results may vary depending on the crypto cipher used.
This is with aes-xts on Intel.
--
Damien Le Moal
Western Digital Research
Download attachment "dm-crypt-sync.png" of type "image/png" (337194 bytes)
Powered by blists - more mailing lists