linux-kernel - ISCSI target engine core LIO(in kernel) performance bottle neck analyze

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <5507F065.1040404@suse.com>
Date:	Tue, 17 Mar 2015 17:14:13 +0800
From:	Zhu Lingshan <LSZhu@...e.com>
To:	linux-kernel@...r.kernel.org
Subject: ISCSI target engine core LIO(in kernel) performance bottle neck analyze


Hi,

I have been working on LIO performance work for weeks, now I can release 
some results and issues, in this mail, I would like to talk about issues 
on CPU usage and  transaction speed. I really hope can get some hints 
and suggestion from you!

Summary:
(1) In 512Bytes, single process, reading case, I found the transaction 
speed is 2.818MB/s in a 1GB network, the running CPU core in initiator 
side spent over 80% cycles in waiting, while one core of LIO side spent 
43.6% in Sys, even no cycles in user, no cycles in wait. I assume the 
bottle neck of this small package, one thread transaction is the lock 
operations on LIO target side.

(2) In 512Bytes, 32 process, reading case, I found the transaction speed 
is 11.259MB/s in a 1GB network, I found there is only one CPU core in 
the LIO target side running, and the load is 100% in SYS. While other 
cores totally free, no workload. I assume the bottle neck of this small 
package, multi threads transaction is the that, no workload balance on 
target side.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here are all detailed information:


My environment:
Two blade severs with E5 CPU and 32GB ram, one run LIO and the other is 
the initiator.
ISCSI backstore: RAM disk, I use the command line "modprobe brd 
rd_size=4200000 max_part=1 rd_nr=1" to create it.(/dev/ram0, and in the 
initiator side it is /dev/sdc).
1GB network.
OS: SUSE Enterprise Linux Sever on both sides, kernel version 3.12.28-4.
Initiator: Open-iSCSI Initiator 2.0873-20.4
LIO-utils: version: 4.1-14.6
My tools: perf, netperf, nmon, FIO


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
For case (1):

In 512Bytes, single process, reading case, I found the transaction speed 
is 2.897MB/s in a 1GB network, the running CPU core in initiator side 
spent over 80% cycles in waiting, while one core of LIO side spent 43.6% 
in Sys, even no cycles in user, no cycles in wait.

I run this test case by the command line:
fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=2G -numjobs=1 
-runtime=600 -group_reporting -name=test.

part of the results:
Jobs: 1 (f=1): [R(1)] [100.0% done] [2818KB/0KB/0KB /s] [5636/0/0 iops] 
[eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=1258: Mon Mar 16 21:48:14 2015
   read : io=262144KB, bw=2897.8KB/s, iops=5795, runt= 90464msec

I run a netperf test with buffer set to 512Bytes and 512Bytes per 
package, get a transaction speed of 6.5MB/s, better than our LIO did, so 
I tried nmon and perf to find why.
This is the screen shot of what nmon show about CPU in the initiator side:


┌nmon─14i─────────────────────Hostname=INIT─────────Refresh=10secs 
───21:30.42────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CPU Utilisation 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
│
│---------------------------+-------------------------------------------------+ 
│
│CPU  User%  Sys% Wait% Idle|0          |25         |50 |75 100| │
│  1   0.0   0.0   0.2 99.8|> | │
│  2   0.1   0.1   0.0 99.8|> | │
│  3   0.0   0.2   0.0 99.8|> | │
│  4   0.0   0.0   0.0 100.0|> | │
│  5   0.0   0.0   0.0 100.0|> | │
│  6   0.0   3.1   0.0 96.9|s> | │
│  7   2.8  12.2  83.8 
1.2|UssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>| │
│  8   0.0   0.0   0.0 100.0|> | │
│  9   0.0   0.0   0.0 100.0|> | │
│ 10   0.0   0.0   0.0 100.0|> | │
│ 11   0.0   0.0   0.0 100.0|> | │
│ 12   0.0   0.0   0.0 100.0|> | │
│---------------------------+-------------------------------------------------+ 
│
│Avg   0.2   1.1   5.8 92.8|WW> | │
│---------------------------+-------------------------------------------------+

We can see on the initiator side, there is only one core running, that 
is ok, but this core spent 83.8% in wait, that seems strange, while on 
the LIO target side, the only running core spent 43.6% in SYS, even no 
cycles in user or wait. Why the initiator waited while there is still 
some free resource(CPU core cycles) on the target side? Then I use perf 
record to monitor the LIO target, I find locks, especially spin lock 
consumed nearly 40% CPU cycles. I assume this is the reason why the 
initiator side shown wait and low speed,lock operation is the bottle 
neck of this case(small package, single thread transaction) Do you have 
any comments on that?

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

For case (2):
In 512Bytes, 32 process, reading case, I found the transaction speed is 
11.259MB/s in a 1GB network, I found there is only one CPU core in the 
LIO target side running, and the load is 100% in SYS. While other cores 
totally free, no workload.

I run the case by this command line:
fio -filename=/dev/sdc  -direct=1 -rw=read  -bs=512 -size=4GB 
-numjobs=32 -runtime=600 -group_reporting -name=test.

The speed is 11.259MB/s. On the LIO target side, I found only one cpu 
core running, all other cores totally free. It seems that  there is not 
a workload balance scheduler. It seems the bottle neck of this 
case(small package, multi threads transaction). Is it nice to be some 
code to balance the transaction traffic to all cores? Hope can get some 
hints, suggestion and why from you experts!



Thanks a lot for your time to read my mail.
Have a nice day!
BR
Zhu Lingshan




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/