[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ae3047fc-bdeb-4b63-91ac-f4cc39434415@grimberg.me>
Date: Wed, 31 Jul 2024 10:36:11 +0300
From: Sagi Grimberg <sagi@...mberg.me>
To: Ping Gan <jacky_gam_2001@....com>, hare@...e.de, hch@....de
Cc: ping.gan@...l.com, kch@...dia.com, linux-nvme@...ts.infradead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 0/2] nvmet: use unbound_wq for RDMA and TCP by default
On 31/07/2024 10:03, Ping Gan wrote:
>> On 26/07/2024 5:34, Ping Gan wrote:
>>>> On 19/07/2024 12:19, Ping Gan wrote:
>>>>> When running nvmf on SMP platform, current nvme target's RDMA and
>>>>> TCP use bounded workqueue to handle IO, but when there is other
>>>>> high
>>>>> workload on the system(eg: kubernetes), the competition between the
>>>>> bounded kworker and other workload is very radical. To decrease the
>>>>> resource race of OS among them, this patchset will switch to
>>>>> unbounded
>>>>> workqueue for nvmet-rdma and nvmet-tcp; besides that, it can also
>>>>> get some performance improvement. And this patchset bases on
>>>>> previous
>>>>> discussion from below session.
>>>>>
>>>>> https://lore.kernel.org/lkml/20240719084953.8050-1-jacky_gam_2001@163.com/
>>>> Hold your horses.
>>>>
>>>> This cannot be just switched without a thorough testing and actual
>>>> justification/proof of
>>>> a benefit beyond just a narrow use-case brought initially by Ping
>>>> Gan.
>>>>
>>>> If the ask is to universally use an unbound workqueue, please
>>>> provide
>>>> detailed
>>>> benchmarking convincing us that this makes sense.
>>> So you think we should not do a radical change for the narrow usecase
>>> but
>>> keep the parameter to enable it in previous version patch, right?
>> What I'm saying is that if you want to change the default, please
>> provide
>> justification in the form of benchmarks that support the change. This
>> benchmarks should include both throughput, iops and latency
>> measurements
>> and without the cpu-set constraints you presented originally.
> We tested it on our testbed which has 4 numa 96 cores, 190GB memory
> and 24 nvme disks, it seems unbound_wq has pretty improvment. The
> creating target test script is below:
>
> #!/bin/bash
> if [ "$#" -ne 3 ] ; then
> echo "$0 addr_trtype(tcp/rdma) target_IP target_port"
> exit -1
> fi
> addr_trtype=$1
> target_IP=$2
> target_port=$3
> # there are 24 nvme disks on the testbed
> disk_list=(nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1 nvme6n1
> nvme7n1 nvme8n1 nvme9n1 nvme10n1 nvme11n1 nvme12n1 nvme13n1 nvme14n1
> nvme15n1 nvme16n1 nvme17n1 nvme18n1 nvme19n1 nvme20n1 nvme21n1 nvme22n1
> nvme23n1)
> # create target with multiple disks
> create_target_multi_disks() {
> local nqn_name=$1
> local svr_ip=$2
> local svr_port=$3
> local i
> local blk_dev
> local blk_dev_idx=0
> local port_idx=25
> echo "create target: $nqn_name $svr_ip $svr_port"
> mkdir /sys/kernel/config/nvmet/subsystems/${nqn_name}
> echo 1
>> /sys/kernel/config/nvmet/subsystems/${nqn_name}/attr_allow_any_host
> for((i=0;i<${#disk_list[@]};i++)); do
> blk_dev_idx=$((${blk_dev_idx}+1))
> blk_dev=/dev/${disk_list[$i]}
> mkdir
> /sys/kernel/config/nvmet/subsystems/${nqn_name}/namespaces/${blk_dev_idx}
> echo ${blk_dev} >
> /sys/kernel/config/nvmet/subsystems/${nqn_name}/namespaces/${blk_dev_idx}/device_path
> echo 1 >
> /sys/kernel/config/nvmet/subsystems/${nqn_name}/namespaces/${blk_dev_idx}/enable
> done
> mkdir /sys/kernel/config/nvmet/ports/${port_idx}
> echo ${addr_trtype}
>> /sys/kernel/config/nvmet/ports/${port_idx}/addr_trtype
> echo ipv4
>> /sys/kernel/config/nvmet/ports/${port_idx}/addr_adrfam
> echo ${svr_ip}
>> /sys/kernel/config/nvmet/ports/${port_idx}/addr_traddr
> echo ${svr_port}
>> /sys/kernel/config/nvmet/ports/${port_idx}/addr_trsvcid
> ln -s /sys/kernel/config/nvmet/subsystems/${nqn_name}/
> /sys/kernel/config/nvmet/ports/${port_idx}/subsystems/${nqn_name}
> }
> nvmetcli clear
> nqn_name="testnqn_25"
> mkdir /sys/kernel/config/nvmet/hosts/hostnqn
> create_target_multi_disks ${nqn_name} ${target_IP} ${target_port}
>
> And the simulation of high workload program is below:
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <stdlib.h>
> #include <pthread.h>
> #include <sched.h>
> #define THREAD_NUM (85)
> #define MALLOC_SIZE (104857600)
> void *loopcostcpu(void *args)
> {
> sleep(1);
> int *core_id = (int *)args;
> cpu_set_t cpuset;
> CPU_ZERO(&cpuset);
> CPU_SET(*core_id, &cpuset);
> sched_setaffinity(0, sizeof(cpuset), &cpuset);
> nice(-20);
> long *pt = malloc(MALLOC_SIZE*sizeof(long));
> if (!pt) {
> printf("error malloc\n");
> return;
> }
> long i = 0;
> while (1) {
> for (i = 0; i < MALLOC_SIZE; i++) {
> pt[i] = i;
> }
> //sleep 10ms
> usleep(10000);
> }
> return;
> }
> int main(int argc, char *argv[])
> {
> pthread_t tid[THREAD_NUM];
> int core_id[THREAD_NUM];
> int result, i, j = 1;
> for (i = 0; i < THREAD_NUM; i++) {
> core_id[i] = j;
> j++;
> result = pthread_create(&tid[i], NULL, loopcostcpu,
> (void*)
> &core_id[i]);
> if (result) {
> printf("create thread %d failure\n", i);
> }
> }
> while(1)
> sleep(5);
> return 0;
> }
>
> When running above program on target testbed, and we reserved 8
> cores(88-95) for nvmet target io threads(both rdma and tcp), then we
> used spdk perf(V20.04) as initiator to create 8 IO queues and per
> queue has 32 queue depths and 1M randrw io size on another testbed
> to verify it.
> TCP's test command shown below:
> ./spdk_perf_tcp -q 32 -S -P 8 -s 4096 -w randrw -t 300 -c 0xff00000 -o
> 1048576 -M 50 -r 'trtype:TCP adrfam:IPv4 traddr:169.254.2.104
> trsvcid:4444'
> RDMA's test command shown below:
> ./spkd_perf_rdma -q 32 -S -P 8 -s 4096 -w randrw -t 300 -c 0xff00000 -o
> 1048576 -M 50 -r 'trtype:RDMA adrfam:IPv4 traddr:169.254.2.104
> trsvcid:4444'
> And we got below test results:
> TCP's unbound_wq: IOPS:4585.64, BW:4585.64 MiB/s, Avglat:167515.56us
> TCP's bound_wq: IOPS:3588.40, BW:3588.40 MiB/s, Avglat:214088.55us
> RDMA's unbound_wq: IOPS:6421.47, BW:6421.47 MiB/s, Avglat:119605.17us
> RDMA's bound_wq: IOPS:5919.94, BW:5919.94 MiB/s, Avglat:129744.70us
>
> It seems using unbound_wq to decreasing competition of CPU between
> target IO worker thread and other high workload does make sense.
It makes sense for the use case, I agree. What I was asking is to test
outside this use-case, where nvmet is used as a JBOF, and not competing
with other intensive workloads. Does unbound workqueues damage performance?
Back in 2016 it absolutely did.
What I would also want to see is a test that addresses latency sensitive
workloads, such
that the load is not high with large block size, but rather small block
size, with medium/low
load and see what is the latency for the two options.
Powered by blists - more mailing lists