linux-kernel - Re: [PATCH v30/2] nvmet: use unbound

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240802033952.125753-1-jacky_gam_2001@163.com>
Date: Fri,  2 Aug 2024 11:39:51 +0800
From: Ping Gan <jacky_gam_2001@....com>
To: sagi@...mberg.me,
	hare@...e.de,
	hch@....de
Cc: ping.gan@...l.com,
	kch@...dia.com,
	linux-nvme@...ts.infradead.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v30/2] nvmet: use unbound_wq for RDMA and TCP by default 

> On 31/07/2024 10:03, Ping Gan wrote:
>>> On 26/07/2024 5:34, Ping Gan wrote:
>>>>> On 19/07/2024 12:19, Ping Gan wrote:
>>>>>> When running nvmf on SMP platform, current nvme target's RDMA and
>>>>>> TCP use bounded workqueue to handle IO, but when there is other
>>>>>> high
>>>>>> workload on the system(eg: kubernetes), the competition between
>>>>>> the
>>>>>> bounded kworker and other workload is very radical. To decrease
>>>>>> the
>>>>>> resource race of OS among them, this patchset will switch to
>>>>>> unbounded
>>>>>> workqueue for nvmet-rdma and nvmet-tcp; besides that, it can also
>>>>>> get some performance improvement. And this patchset bases on
>>>>>> previous
>>>>>> discussion from below session.
>>>>>>
>>>>>> https://lore.kernel.org/lkml/20240719084953.8050-1-jacky_gam_2001@163.com/
>>>>> Hold your horses.
>>>>>
>>>>> This cannot be just switched without a thorough testing and actual
>>>>> justification/proof of
>>>>> a benefit beyond just a narrow use-case brought initially by Ping
>>>>> Gan.
>>>>>
>>>>> If the ask is to universally use an unbound workqueue, please
>>>>> provide
>>>>> detailed
>>>>> benchmarking convincing us that this makes sense.
>>>> So you think we should not do a radical change for the narrow
>>>> usecase
>>>> but
>>>> keep the parameter to enable it in previous version patch, right?
>>> What I'm saying is that if you want to change the default, please
>>> provide
>>> justification in the form of benchmarks that support the change.
>>> This
>>> benchmarks should include both throughput, iops and latency
>>> measurements
>>> and without the cpu-set constraints you presented originally.
>> We tested it on our testbed which has 4 numa 96 cores, 190GB memory
>> and 24 nvme disks, it seems unbound_wq has pretty improvment.  The
>> creating target test script is below:
>>
>> #!/bin/bash
>> if [ "$#" -ne 3 ] ; then
>> echo "$0 addr_trtype(tcp/rdma) target_IP target_port"
>> exit -1
>> fi
>> addr_trtype=$1
>> target_IP=$2
>> target_port=$3
>> # there are 24 nvme disks on the testbed
>> disk_list=(nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1 nvme6n1
>> nvme7n1 nvme8n1 nvme9n1 nvme10n1 nvme11n1 nvme12n1 nvme13n1 nvme14n1
>> nvme15n1 nvme16n1 nvme17n1 nvme18n1 nvme19n1 nvme20n1 nvme21n1
>> nvme22n1
>> nvme23n1)
>> # create target with multiple disks
>> create_target_multi_disks() {
>>          local nqn_name=$1
>>          local svr_ip=$2
>>          local svr_port=$3
>>          local i
>>          local blk_dev
>>          local blk_dev_idx=0
>>          local port_idx=25
>>          echo "create target: $nqn_name $svr_ip $svr_port"
>>          mkdir /sys/kernel/config/nvmet/subsystems/${nqn_name}
>>          echo 1
>> > /sys/kernel/config/nvmet/subsystems/${nqn_name}/attr_allow_any_host
>>          for((i=0;i<${#disk_list[@]};i++)); do
>>                  blk_dev_idx=$((${blk_dev_idx}+1))
>>                  blk_dev=/dev/${disk_list[$i]}
>>                  mkdir
>> /sys/kernel/config/nvmet/subsystems/${nqn_name}/namespaces/${blk_dev_idx}
>>                  echo  ${blk_dev} >
>> /sys/kernel/config/nvmet/subsystems/${nqn_name}/namespaces/${blk_dev_idx}/device_path
>>                  echo 1 >
>> /sys/kernel/config/nvmet/subsystems/${nqn_name}/namespaces/${blk_dev_idx}/enable
>>          done
>>          mkdir /sys/kernel/config/nvmet/ports/${port_idx}
>>          echo ${addr_trtype}
>> > /sys/kernel/config/nvmet/ports/${port_idx}/addr_trtype
>>          echo ipv4
>> > /sys/kernel/config/nvmet/ports/${port_idx}/addr_adrfam
>>          echo ${svr_ip}
>> > /sys/kernel/config/nvmet/ports/${port_idx}/addr_traddr
>>          echo ${svr_port}
>> > /sys/kernel/config/nvmet/ports/${port_idx}/addr_trsvcid
>>          ln -s /sys/kernel/config/nvmet/subsystems/${nqn_name}/
>> /sys/kernel/config/nvmet/ports/${port_idx}/subsystems/${nqn_name}
>> }
>> nvmetcli clear
>> nqn_name="testnqn_25"
>> mkdir /sys/kernel/config/nvmet/hosts/hostnqn
>> create_target_multi_disks ${nqn_name} ${target_IP} ${target_port}
>>
>> And the simulation of high workload program is below:
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <stdlib.h>
>> #include <pthread.h>
>> #include <sched.h>
>> #define THREAD_NUM  (85)
>> #define MALLOC_SIZE (104857600)
>> void *loopcostcpu(void *args)
>> {
>>          sleep(1);
>>          int *core_id = (int *)args;
>>          cpu_set_t cpuset;
>>          CPU_ZERO(&cpuset);
>>          CPU_SET(*core_id, &cpuset);
>>          sched_setaffinity(0, sizeof(cpuset), &cpuset);
>>          nice(-20);
>>          long *pt = malloc(MALLOC_SIZE*sizeof(long));
>>          if (!pt) {
>>                  printf("error malloc\n");
>>                  return;
>>          }
>>          long i = 0;
>>          while (1) {
>>                  for (i = 0; i < MALLOC_SIZE; i++) {
>>                          pt[i] = i;
>>                  }
>>                  //sleep 10ms
>>                  usleep(10000);
>>          }
>>          return;
>> }
>> int main(int argc, char *argv[])
>> {
>>          pthread_t tid[THREAD_NUM];
>>          int core_id[THREAD_NUM];
>>          int result, i, j = 1;
>>          for (i = 0; i < THREAD_NUM; i++) {
>>                  core_id[i] = j;
>>                  j++;
>>                  result = pthread_create(&tid[i], NULL, loopcostcpu,
>> 									(void*)
>> &core_id[i]);
>>                  if (result) {
>>                          printf("create thread %d failure\n", i);
>>                  }
>>          }
>>          while(1)
>>                  sleep(5);
>>          return 0;
>> }
>>
>> When running above program on target testbed, and we reserved 8
>> cores(88-95) for nvmet target io threads(both rdma and tcp), then we
>> used spdk perf(V20.04) as initiator to create 8 IO queues and per
>> queue has 32 queue depths and 1M randrw io size on another testbed
>> to verify it.
>> TCP's test command shown below:
>> ./spdk_perf_tcp -q 32 -S -P 8 -s 4096 -w randrw -t 300 -c 0xff00000
>> -o
>> 1048576 -M 50 -r 'trtype:TCP adrfam:IPv4 traddr:169.254.2.104
>> trsvcid:4444'
>> RDMA's test command shown below:
>> ./spkd_perf_rdma -q 32 -S -P 8 -s 4096 -w randrw -t 300 -c 0xff00000
>> -o
>> 1048576 -M 50 -r 'trtype:RDMA adrfam:IPv4 traddr:169.254.2.104
>> trsvcid:4444'
>> And we got below test results:
>> TCP's unbound_wq:  IOPS:4585.64, BW:4585.64 MiB/s, Avglat:167515.56us
>> TCP's bound_wq:    IOPS:3588.40, BW:3588.40 MiB/s, Avglat:214088.55us
>> RDMA's unbound_wq: IOPS:6421.47, BW:6421.47 MiB/s, Avglat:119605.17us
>> RDMA's bound_wq:   IOPS:5919.94, BW:5919.94 MiB/s, Avglat:129744.70us
>>
>> It seems using unbound_wq to decreasing competition of CPU between
>> target IO worker thread and other high workload does make sense.
>
> It makes sense for the use case, I agree. What I was asking is to test
> outside this use-case, where nvmet is used as a JBOF, and not
> competing
> with other intensive workloads. Does unbound workqueues damage
> performance?
> Back in 2016 it absolutely did.
>
> What I would also want to see is a test that addresses latency
> sensitive 
> workloads, such
> that the load is not high with large block size, but rather small
> block 
> size, with medium/low
> load and see what is the latency for the two options.

We had done two group tests for unbound_wq and bound_wq; per group had 
6 round tests which included TCP 1M IO size without other workload, 
TCP 4K IO size without other workload, TCP 4K IO size with medium 
workload(about 45% CPU cost and 25% memory cost), RDMA 1M IO size 
without other workload, RDMA 4K IO size without other workload, 
RDMA 4K IO size with medium workload. And every round test we used
8 IO queues, per queue had 32 queue depths and no CPU affinity with 
randrw disk to run 1 hour test and we got below results.

TCP 1M bound_wq:   IOPS:8120.38, BW:8120.38 MiB/s, Avglat:94577.77us
TCP 1M unbound_wq: IOPS:8236.16, BW:8236.16 MiB/s, Avglat:93248.18us

TCP 4K bound_wq:   IOPS:1326767.00, BW:5182.68 MiB/s, Avglat:578.83us
TCP 4K unbound_wq: IOPS:952239.52,  BW:3719.69 MiB/s, Avglat:806.49us

TCP 4K with medium workload bound_wq:
                   IOPS:944414.21, BW:3689.12 MiB/s, Avglat:813.18us
TCP 4K with medium workload unbound_wq:    
                   IOPS:855103.18, BW:3340.25 MiB/s, Avglat:898.11us

RDMA 1M bound_wq:  IOPS:10111.35, BW:10111.35 MiB/s, Avglat:75954.55us
RDMA 1M unbound_wq:IOPS:10108.84, BW:10108.84 MiB/s, Avglat:75973.39us

RDMA 4K bound_wq:  IOPS:2645207.01, BW:10332.84 MiB/s, Avglat:290.31us
RDMA 4K unbound_wq:IOPS:2644785.78, BW:10331.19 MiB/s, Avglat:290.35us

RDMA 4K with medium workload bound_wq:
                   IOPS:2595758.58, BW:10139.68 MiB/s, Avglat:295.84us
RDMA 4K with medium workload unbound_wq:    
                   IOPS:2551177.45, BW:9965.54 MiB/s, Avglat:301.01us

It seems in TCP small block size case the unbound_wq has tremendous
performance drop. So I think we should not radically change the default
workqueue as unbounded but keep the previous patch with parameter to
support the narrow case for performance improvement.

Thanks,
Ping