netdev - Re: [PATCH] pktgen: add a new sample script for 40G and above link testing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1504273689.50064.21.camel@linux.intel.com>
Date:   Fri, 01 Sep 2017 21:48:09 +0800
From:   Robert Hoo <robert.hu@...ux.intel.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        davem@...emloft.net, tariqt@...lanox.com, kyle.leet@...il.com
Subject: Re: [PATCH] pktgen: add a new sample script for 40G and above link
 testing

On Fri, 2017-08-25 at 11:19 +0200, Jesper Dangaard Brouer wrote:
> (please don't use BCC on the netdev list, replies might miss the list in cc)
> 
> Comments inlined below:
> 
> On Fri, 25 Aug 2017 10:24:30 +0800 Robert Hoo <robert.hu@...el.com> wrote:
> 
> > From: Robert Ho <robert.hu@...el.com>
> > 
> > It's hard to benchmark 40G+ network bandwidth using ordinary
> > tools like iperf, netperf. I then tried with pktgen multiqueue sample
> > scripts, but still cannot reach line rate.
> 
> The pktgen_sample02_multiqueue.sh does not use burst or skb_cloning.
> Thus, the performance will suffer.
> 
> See the samples that use the burst feature:
>   pktgen_sample03_burst_single_flow.sh
>   pktgen_sample05_flow_per_thread.sh
> 
> With the pktgen "burst" feature, I can easily generate 40G.  Generating
> 100G is also possible, but often you will hit some HW limits before the
> pktgen limit.  I experienced hitting both (1) PCIe Gen3 x8 limit, and (2)
> memory bandwidth limit.

Thanks Jesper for review. Sorry for late reply, I do this part time.

I just tried 'pktgen_sample03_burst_single_flow.sh' and 'pktgen_sample05_flow_per_thread.sh'
cmd:
	./pktgen_sample05_flow_per_thread.sh -i ens801 -s 1500 -m 3c:fd:fe:9d:6f:f0 -t 2 -v -x -d 192.168.0.107
	./pktgen_sample03_burst_single_flow.sh -i ens801 -s 1500 -m 3c:fd:fe:9d:6f:f0 -t 2 -v -x -d 192.168.0.107

indeed, they can achieve nearly 40G. (though still slightly less than my 
script). pktgen_sample03 and pktgen_sample05 can approximately achieve 38xxxMb/sec ~ 39xxxMb/sec;
my script can achieve 40xxxMb/sec ~ 41xxxMb/sec. (threads >= 2)

So a general question: is it still necessary to continue my sample06_numa_awared_queue_irq_affinity work? as sample03
and sample05 already approximately achieved 40G line rate.

> 
> 
> > I then derived this NUMA awared irq affinity sample script from
> > multi-queue sample one, successfully benchmarked 40G link. I think this can
> > also be useful for 100G reference, though I haven't got device to test.
> 
> Okay, so your issue was really related to NUMA irq affinity.  I do feel
> that IRQ tuning lives outside the realm of the pktgen scripts, but
> looking closer at your script, I it doesn't look like you change the
> IRQ setting which is good.  

Sorry I don't quite understand above. I changed the irq affinities.
See "echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list".
You would not like me to change it? I can restore them to original at the end
of the script.
> 
> You introduce some helper functions take makes it possible to extract
> NUMA information in the shell script code, really cool.  I would like
> to see these functions being integrated into the function.sh file.

Yes, it is doable, if you maintainer think so.
> 
>  
> > This script simply does:
> > Detect $DEV's NUMA node belonging.
> > Bind each thread (processor from that NUMA node) with each $DEV queue's
> > irq affinity, 1:1 mapping.
> > How many '-t' threads input determines how many queues will be
> > utilized.
> > 
> > Tested with Intel XL710 NIC with Cisco 3172 switch.
> > 
> > It would be even slightly better if the irqbalance service is turned
> > off outside.
> 
> Yes, if you don't turn-off (kill) irqbalance it will move around the
> IRQs behind your back...

Yes; while the experiment result turns out it affects just very little.
> 
>  
> > Referrences:
> > https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> > http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> > 
> > Signed-off-by: Robert Hoo <robert.hu@...el.com>
> > ---
> >  ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
> >  1 file changed, 132 insertions(+)
> >  create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > 
> > diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > new file mode 100755
> > index 0000000..f0ee25c
> > --- /dev/null
> > +++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> > @@ -0,0 +1,132 @@
> > +#!/bin/bash
> > +#
> > +# Multiqueue: Using pktgen threads for sending on multiple CPUs
> > +#  * adding devices to kernel threads which are in the same NUMA node
> > +#  * bound devices queue's irq affinity to the threads, 1:1 mapping
> > +#  * notice the naming scheme for keeping device names unique
> > +#  * nameing scheme: dev@...ead_number
> > +#  * flow variation via random UDP source port
> > +#
> > +basedir=`dirname $0`
> > +source ${basedir}/functions.sh
> > +root_check_run_with_sudo "$@"
> > +#
> > +# Required param: -i dev in $DEV
> > +source ${basedir}/parameters.sh
> > +
> > +get_iface_node()
> > +{
> > +	echo `cat /sys/class/net/$1/device/numa_node`
> 
> Here you could use the following shell trick to avoid using "cat":
> 
>  echo $(</sys/class/net/$1/device/numa_node)

Thanks for teaching. Indeed this is more concise.
> 
> It looks like you don't handle the case of -1, which indicate non-NUMA
> system.  You need to use something like::
> 
> get_iface_node()
> {
>     local node=$(</sys/class/net/$1/device/numa_node)
>     if [[ $node == -1 ]]; then
> 	echo 0
>     else
> 	echo $node
>     fi
> }

Yes, I can amend in v2.
> 
> 
> > +}
> > +
> > +get_iface_irqs()
> > +{
> > +	local IFACE=$1
> > +	local queues="${IFACE}-.*TxRx"
> > +
> > +	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
> > +	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
> > +	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
> > +		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
> > +	    done)
> 
> Nice that you handle all these different methods.  I personally look
> in /proc/irq/*/$IFACE*/../smp_affinity_list , like (copy-paste):
> 
> echo " --- Align IRQs ---"
> # I've named my NICs ixgbe1 + ixgbe2
> for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
>    # Extract irqname e.g. "ixgbe2-TxRx-2"
>    irqname=$(basename $(dirname $(dirname $F))) ;
>    # Substring pattern removal
>    hwq_nr=${irqname#*-*-}
>    echo $hwq_nr > $F
>    #grep . -H $F;
> done
> grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list
> 
> Maybe I should switch to use:
>    /sys/class/net/$IFACE/device/msi_irqs/*
>  
> 
> > +	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"
> 
> In the error case you should let the script die.  There is a helper
> function for this called "err" (where first arg is the exitcode, which
> is useful to detect the reason your script failed).

Yes, I noticed that helper function and changed some of my original "echo Error"s;
this is a missing in my code clear/tidy work. I can amend in v2.
> 
> 
> > +	echo $irqs
> > +}
> 
> > +get_node_cpus()
> > +{
> > +	local node=$1
> > +	local node_cpu_list
> > +	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
> > +			/sys/devices/system/node/node$node/cpulist`
> > +
> > +	for cpu_range in $node_cpu_range_list
> > +	do
> > +		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
> > +	done
> > +
> > +	echo $node_cpu_list
> > +}
> > +
> > +
> > +# Base Config
> > +DELAY="0"        # Zero means max speed
> > +COUNT="20000000"   # Zero means indefinitely
> > +[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
> > +
> > +# Flow variation random source port between min and max
> > +UDP_MIN=9
> > +UDP_MAX=109
> > +
> > +node=`get_iface_node $DEV`
> > +irq_array=(`get_iface_irqs $DEV`)
> > +cpu_array=(`get_node_cpus $node`)
> 
> Nice trick to generate an array.
> 
> > +
> > +[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
> > +	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
> > +
> > +# (example of setting default params in your script)
> > +if [ -z "$DEST_IP" ]; then
> > +    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
> > +fi
> > +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
> > +
> > +# General cleanup everything since last run
> > +pg_ctrl "reset"
> > +
> > +# Threads are specified with parameter -t value in $THREADS
> > +for ((i = 0; i < $THREADS; i++)); do
> > +    # The device name is extended with @name, using thread number to
> > +    # make then unique, but any name will do.
> > +    # Set the queue's irq affinity to this $thread (processor)
> > +    thread=${cpu_array[$i]}
> > +    dev=${DEV}@...hread}
> > +    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
> > +    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
> > +
> > +    # Add remove all other devices and add_device $dev to thread
> > +    pg_thread $thread "rem_device_all"
> > +    pg_thread $thread "add_device" $dev
> > +
> > +    # select queue and bind the queue and $dev in 1:1 relationship
> > +    queue_num=$i
> > +    echo "queue number is $queue_num"
> > +    pg_set $dev "queue_map_min $queue_num"
> > +    pg_set $dev "queue_map_max $queue_num"
> > +
> > +    # Notice config queue to map to cpu (mirrors smp_processor_id())
> > +    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
> > +    pg_set $dev "flag QUEUE_MAP_CPU"
> > +
> > +    # Base config of dev
> > +    pg_set $dev "count $COUNT"
> > +    pg_set $dev "clone_skb $CLONE_SKB"
> > +    pg_set $dev "pkt_size $PKT_SIZE"
> > +    pg_set $dev "delay $DELAY"
> > +
> > +    # Flag example disabling timestamping
> > +    pg_set $dev "flag NO_TIMESTAMP"
> > +
> > +    # Destination
> > +    pg_set $dev "dst_mac $DST_MAC"
> > +    pg_set $dev "dst$IP6 $DEST_IP"
> > +
> > +    # Setup random UDP port src range
> > +    pg_set $dev "flag UDPSRC_RND"
> > +    pg_set $dev "udp_src_min $UDP_MIN"
> > +    pg_set $dev "udp_src_max $UDP_MAX"
> > +done
> > +
> > +# start_run
> > +echo "Running... ctrl^C to stop" >&2
> > +pg_ctrl "start"
> > +echo "Done" >&2
> > +
> > +# Print results
> > +for ((i = 0; i < $THREADS; i++)); do
> > +    thread=${cpu_array[$i]}
> > +    dev=${DEV}@...hread}
> > +    echo "Device: $dev"
> > +    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
> > +done
> 
> 
>