linux-kernel - Re: [PATCH 0/2] Introduce the request handling for dm-crypt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMz4kuLTuY5CpTfW4Gijesy1Pp9PxbWj076XeoV24Hj3EydifQ@mail.gmail.com>
Date:	Wed, 2 Dec 2015 20:46:54 +0800
From:	Baolin Wang <baolin.wang@...aro.org>
To:	Mark Brown <broonie@...nel.org>
Cc:	Jens Axboe <axboe@...nel.dk>, Mike Snitzer <snitzer@...hat.com>,
	Alasdair G Kergon <agk@...hat.com>, dm-devel@...hat.com,
	neilb@...e.com, linux-raid@...r.kernel.org,
	Jan Kara <jack@...e.cz>, Arnd Bergmann <arnd@...db.de>,
	LKML <linux-kernel@...r.kernel.org>, keith.busch@...el.com,
	jmoyer@...hat.com, tj@...nel.org, bart.vanassche@...disk.com,
	"Garg, Dinesh" <dineshg@...cinc.com>
Subject: Re: [PATCH 0/2] Introduce the request handling for dm-crypt

Hi All,

These are the benchmarks for request based dm-crypt. Please check it.

一、Environment
1. Hardware configuration
Board: Beaglebone black
Processor: Am335x 1GHz ARM Cortex-A8
RAM:512M
SD card:8G
Kernel version:4.4-rc1

2. Encryption method
(1) Use cbc(aes) cipher to encrypt the block device with dmsetup tool
dmsetup create dm-0 --table “0 `blockdev --getsize /dev/mmcblk0p1`
crypt aes-cbc-plain:sha256 babebabebabebabebabebabebabebabe 0
/dev/mmcblk0p1  0”

(2) Enable the AES engine by config
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_OMAP_AES=y

(3) Limitation
We want to test it on ramdisk devices rather than slow media devices
(SD card) firstly, but here we can't use ramdisk to be mapped with
dmsetup tool. Cause ramdisk device is non-request-stackable device, it
can not be used for request-based dm.

二. Result summary
1. Results table
-----------------------------------------------------------------------------------------------
| Test                       | size    | bio           | request
| % change |
-----------------------------------------------------------------------------------------------
| dd sequential read   | 1G      | 5.6Mb/s   | 11.3Mb/s   | +101.8%   |
-----------------------------------------------------------------------------------------------
| dd sequential write  | 1G      | 4.2Mb/s   | 6.8Mb/s     | +61.9%     |
-----------------------------------------------------------------------------------------------
| fio sequential read  | 1G      | 5336KB/s | 10928KB/s | +104.8%   |
-----------------------------------------------------------------------------------------------
| fio sequential write | 1G      | 4049KB/s | 6574KB/s   | +62.4%     |
-----------------------------------------------------------------------------------------------

2. Summary
>From all the test data with the dd/fio tools, the test results
basically are coincident though the different tools, so it can
basically reflect the IO performance effection by request based
opimization.

It has a larger IO performance effection for reading speed, and it'll
have a larger improvement (at least a double improvement) when
enabling the request based opimization.

Also it will have a big improvement for writing speed, and it is
increased about 50% when enabling the request based opimization. But
for random writing, it has a litle difference limited by slow hardware
random accessing.

三. DD test procedure
dd can be used for simplified copying of data at the low level with
operating the raw devices.
dd can provide good basic coverage but isn't very realistic, and only
provide sequential IO accessing.
But we can use dd to read/write the raw devices without the
filesystem's caches effection, test result as below:

1. Sequential read:
(1) Sequential read 1G with bio based:
time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 192.091 s, 5.6 MB/s
real    3m12.112s
user    0m0.070s
sys     0m3.820s

(2) Sequential read 1G with requset based:
time dd if=/dev/dm-0 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 94.8922 s, 11.3 MB/s
real    1m34.908s
user    0m0.030s
sys     0m4.000s

(3) Sequential read 1G without encryption:
time dd if=/dev/mmcblk0p1 of=/dev/null bs=64K count=16384 iflag=direct
1073741824 bytes (1.1 GB) copied, 58.49 s, 18.4 MB/s
real    0m58.505s
user    0m0.040s
sys     0m3.050s

2. Sequential write:
(1) Sequential write 1G with bio based:
time dd if=/dev/zero of=/dev/dm-0 bs=64K count=16384 oflag=direct
1073741824 bytes (1.1 GB) copied, 253.477 s, 4.2 MB/s
real    4m13.497s
user    0m0.130s
sys     0m3.990s

(2) Sequential write 1G with requset based:
time dd if=/dev/zero of=/dev/dm-0 bs=64K count=16384 oflag=direct
1073741824 bytes (1.1 GB) copied, 157.396 s, 6.8 MB/s
real    2m37.414s
user    0m0.130s
sys     0m4.190s

(3) Sequential write 1G without encryption:
time dd if=/dev/zero of=/dev/mmcblk0p1 bs=64K count=16384 oflag=direct
1073741824 bytes (1.1 GB) copied, 120.452 s, 8.9 MB/s
real    2m0.471s
user    0m0.050s
sys     0m3.820s

3. Summary:
we can see the sequential read/write speed with bio based is: 5.6MB/s
and 4.2MB/s, but when encrypting the block device with request based
things, the sequential read/write speed can be increased to: 11.3MB/s
and 6.8MBs. So sequential reading and writing speed have a big
different with request based, speed are increased by 101.8% and 61.9%.
Meanwhile we also can see the difference in 'sys' time with request
based optimizations.

三、Fio test procedure
We specify the block size is 64K, and command like:
fio --filename=/dev/dm-0 --direct=1 --iodepth=1 --rw=read --bs=64K
--size=1G --group_reporting --numjobs=1 --name=test_read

1. Sequential read 1G with bio based:
READ: io=1024.0MB, aggrb=5336KB/s, minb=5336KB/s, maxb=5336KB/s,
mint=196494msec, maxt=196494msec

2. Sequential write 1G with bio based:
WRITE: io=1024.0MB, aggrb=4049KB/s, minb=4049KB/s, maxb=4049KB/s,
mint=258954msec, maxt=258954msec

3. Sequential read 1G with request based:
READ: io=1024.0MB, aggrb=10928KB/s, minb=10928KB/s, maxb=10928KB/s,
mint=95947msec, maxt=95947msec

4. Sequential write 1G with request based:
WRITE: io=1024.0MB, aggrb=6574KB/s, minb=6574KB/s, maxb=6574KB/s,
mint=159493msec, maxt=159493msec

5. Summary:
(1) read:
The sequential read speed has a big improvment with reuest based
things, which is increased by 104.8% when the reuest based things are
enabled for dm-crypt. It can not be a really random read if we specify
the block size, so the data doesn't list the random read speed
improvements, though it show big improvements.

(2) write:
The sequential write speed has some improvements with request based,
which is increased about by 62.4%. But for random write, this part is
very hard to measure on an SD card though, because any random write
smaller than the underlying block size will cause long I/O latencies
at some point, which is can not show the improvements.

四、IO block size test
We also change the block size from 4K to 1M (most IO block size in
practice are much smaller than 1M) to see the block size influences
with reuest based for dm-crypt.

1. Sequential read 1G
(1) block size = 4k
time dd if=/dev/dm-0 of=/dev/null bs=4k count=262144 iflag=direct
1073741824 bytes (1.1 GB) copied, 310.598 s, 3.5 MB/s
real    5m10.614s
user    0m0.610s
sys     0m36.040s

(2) block size = 64k
1073741824 bytes (1.1 GB) copied, 95.0489 s, 11.3 MB/s
real    1m35.071s
user    0m0.040s
sys     0m4.030s

(3) block size = 256k
1073741824 bytes (1.1 GB) copied, 84.3311 s, 12.7 MB/s
real    1m24.347s
user    0m0.050s
sys     0m1.950s

(4) block size = 1M
1073741824 bytes (1.1 GB) copied, 80.8778 s, 13.3 MB/s
real    1m20.893s
user    0m0.010s
sys     0m1.390s

2. Sequential write 1G
(1) block size = 4k
time dd if=/dev/zero of=/dev/dm-0 bs=4K count=262144 oflag=direct
1073741824 bytes (1.1 GB) copied, 629.656 s, 1.7 MB/s
real    10m29.671s
user    0m0.790s
sys     0m33.550s

(2) block size = 64k
1073741824 bytes (1.1 GB) copied, 155.697 s, 6.9 MB/s
real    2m35.713s
user    0m0.040s
sys     0m4.110s

(3) block size = 256k
1073741824 bytes (1.1 GB) copied, 143.682 s, 7.5 MB/s
real    2m23.698s
user    0m0.040s
sys     0m2.500s

(4) block size = 1M
1073741824 bytes (1.1 GB) copied, 140.654 s, 7.6 MB/s
real    2m20.670s
user    0m0.040s
sys     0m2.090s

3. Summary
For request based things, some sequential bios/requests can merged
into one request to expand the IO size to be a big block handled by
hardware engine at one time. With the hardware acceleration, it can
improve the encryption/decryption speed, so the hardware engine can
play the best performance with big block size.

>From the data, we also can see the reading/writing speed can be
increased by expanding the block size, which means that it doesn't
help much for small sizes in. But when the block size is above 64K,
the speed dose not get the corresponding performance benefits, I think
the speed limitation is also in cryto that lets the bigger bios can't
get the similar performance benefits, which is need more
investigations.

On 13 November 2015 at 19:51, Mark Brown <broonie@...nel.org> wrote:
> On Thu, Nov 12, 2015 at 08:26:26AM -0700, Jens Axboe wrote:
>> On 11/12/2015 03:04 AM, Mark Brown wrote:
>
>> >Android now wants to encrypt phones and tablets by default and have been
>> >seeing substantial performance hits as a result, we can try to get
>> >people to share performance data from productionish systems but it might
>> >be difficult.
>
>> Well, shame on them for developing out-of-tree, looks like they are reaping
>> all the benefits of that.
>
>> Guys, we need some numbers, enough with the hand waving. There's no point
>> discussing this further until we know how much of a difference it makes to
>> handle X MB chunks instead of Y MB chunks. As was previously stated, unless
>> there's a _substantial_ performance benefit, this patchset isn't going
>> anywhere.
>
> Yeah, what I'm saying here is that there will issues getting the numbers
> from relevant production systems - we are most likely to be looking at
> proxies which are hopefully reasonably representative but there's likely
> to be more divergence than you'd see just running benchmark workloads on
> similar systems to those used in production.



-- 
Baolin.wang
Best Regards
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/