netdev - Re: [PATCH] pktgen: add a new sample script for 40G and above link testing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170825111921.061713c8@redhat.com>
Date:   Fri, 25 Aug 2017 11:19:21 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Robert Hoo <robert.hu@...el.com>
Cc:     brouer@...hat.com, robert.hu@...ux.intel.com,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH] pktgen: add a new sample script for 40G and above link
 testing


(please don't use BCC on the netdev list, replies might miss the list in cc)

Comments inlined below:

On Fri, 25 Aug 2017 10:24:30 +0800 Robert Hoo <robert.hu@...el.com> wrote:

> From: Robert Ho <robert.hu@...el.com>
> 
> It's hard to benchmark 40G+ network bandwidth using ordinary
> tools like iperf, netperf. I then tried with pktgen multiqueue sample
> scripts, but still cannot reach line rate.

The pktgen_sample02_multiqueue.sh does not use burst or skb_cloning.
Thus, the performance will suffer.

See the samples that use the burst feature:
  pktgen_sample03_burst_single_flow.sh
  pktgen_sample05_flow_per_thread.sh

With the pktgen "burst" feature, I can easily generate 40G.  Generating
100G is also possible, but often you will hit some HW limits before the
pktgen limit.  I experienced hitting both (1) PCIe Gen3 x8 limit, and (2)
memory bandwidth limit.


> I then derived this NUMA awared irq affinity sample script from
> multi-queue sample one, successfully benchmarked 40G link. I think this can
> also be useful for 100G reference, though I haven't got device to test.

Okay, so your issue was really related to NUMA irq affinity.  I do feel
that IRQ tuning lives outside the realm of the pktgen scripts, but
looking closer at your script, I it doesn't look like you change the
IRQ setting which is good.  

You introduce some helper functions take makes it possible to extract
NUMA information in the shell script code, really cool.  I would like
to see these functions being integrated into the function.sh file.

 
> This script simply does:
> Detect $DEV's NUMA node belonging.
> Bind each thread (processor from that NUMA node) with each $DEV queue's
> irq affinity, 1:1 mapping.
> How many '-t' threads input determines how many queues will be
> utilized.
> 
> Tested with Intel XL710 NIC with Cisco 3172 switch.
> 
> It would be even slightly better if the irqbalance service is turned
> off outside.

Yes, if you don't turn-off (kill) irqbalance it will move around the
IRQs behind your back...

 
> Referrences:
> https://people.netfilter.org/hawk/presentations/LCA2015/net_stack_challenges_100G_LCA2015.pdf
> http://www.intel.cn/content/dam/www/public/us/en/documents/reference-guides/xl710-x710-performance-tuning-linux-guide.pdf
> 
> Signed-off-by: Robert Hoo <robert.hu@...el.com>
> ---
>  ...tgen_sample06_numa_awared_queue_irq_affinity.sh | 132 +++++++++++++++++++++
>  1 file changed, 132 insertions(+)
>  create mode 100755 samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> 
> diff --git a/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> new file mode 100755
> index 0000000..f0ee25c
> --- /dev/null
> +++ b/samples/pktgen/pktgen_sample06_numa_awared_queue_irq_affinity.sh
> @@ -0,0 +1,132 @@
> +#!/bin/bash
> +#
> +# Multiqueue: Using pktgen threads for sending on multiple CPUs
> +#  * adding devices to kernel threads which are in the same NUMA node
> +#  * bound devices queue's irq affinity to the threads, 1:1 mapping
> +#  * notice the naming scheme for keeping device names unique
> +#  * nameing scheme: dev@...ead_number
> +#  * flow variation via random UDP source port
> +#
> +basedir=`dirname $0`
> +source ${basedir}/functions.sh
> +root_check_run_with_sudo "$@"
> +#
> +# Required param: -i dev in $DEV
> +source ${basedir}/parameters.sh
> +
> +get_iface_node()
> +{
> +	echo `cat /sys/class/net/$1/device/numa_node`

Here you could use the following shell trick to avoid using "cat":

 echo $(</sys/class/net/$1/device/numa_node)

It looks like you don't handle the case of -1, which indicate non-NUMA
system.  You need to use something like::

get_iface_node()
{
    local node=$(</sys/class/net/$1/device/numa_node)
    if [[ $node == -1 ]]; then
	echo 0
    else
	echo $node
    fi
}


> +}
> +
> +get_iface_irqs()
> +{
> +	local IFACE=$1
> +	local queues="${IFACE}-.*TxRx"
> +
> +	irqs=$(grep "$queues" /proc/interrupts | cut -f1 -d:)
> +	[ -z "$irqs" ] && irqs=$(grep $IFACE /proc/interrupts | cut -f1 -d:)
> +	[ -z "$irqs" ] && irqs=$(for i in `ls -Ux /sys/class/net/$IFACE/device/msi_irqs` ;\
> +		do grep "$i:.*TxRx" /proc/interrupts | grep -v fdir | cut -f 1 -d : ;\
> +	    done)

Nice that you handle all these different methods.  I personally look
in /proc/irq/*/$IFACE*/../smp_affinity_list , like (copy-paste):

echo " --- Align IRQs ---"
# I've named my NICs ixgbe1 + ixgbe2
for F in /proc/irq/*/ixgbe*-TxRx-*/../smp_affinity_list; do
   # Extract irqname e.g. "ixgbe2-TxRx-2"
   irqname=$(basename $(dirname $(dirname $F))) ;
   # Substring pattern removal
   hwq_nr=${irqname#*-*-}
   echo $hwq_nr > $F
   #grep . -H $F;
done
grep -H . /proc/irq/*/ixgbe*/../smp_affinity_list

Maybe I should switch to use:
   /sys/class/net/$IFACE/device/msi_irqs/*
 

> +	[ -z "$irqs" ] && echo "Error: Could not find interrupts for $IFACE"

In the error case you should let the script die.  There is a helper
function for this called "err" (where first arg is the exitcode, which
is useful to detect the reason your script failed).


> +	echo $irqs
> +}

> +get_node_cpus()
> +{
> +	local node=$1
> +	local node_cpu_list
> +	local node_cpu_range_list=`cut -f1- -d, --output-delimiter=" " \
> +			/sys/devices/system/node/node$node/cpulist`
> +
> +	for cpu_range in $node_cpu_range_list
> +	do
> +		node_cpu_list="$node_cpu_list "`seq -s " " ${cpu_range//-/ }`
> +	done
> +
> +	echo $node_cpu_list
> +}
> +
> +
> +# Base Config
> +DELAY="0"        # Zero means max speed
> +COUNT="20000000"   # Zero means indefinitely
> +[ -z "$CLONE_SKB" ] && CLONE_SKB="0"
> +
> +# Flow variation random source port between min and max
> +UDP_MIN=9
> +UDP_MAX=109
> +
> +node=`get_iface_node $DEV`
> +irq_array=(`get_iface_irqs $DEV`)
> +cpu_array=(`get_node_cpus $node`)

Nice trick to generate an array.

> +
> +[ $THREADS -gt ${#irq_array[*]} -o $THREADS -gt ${#cpu_array[*]}  ] && \
> +	err 1 "Thread number $THREADS exceeds: min (${#irq_array[*]},${#cpu_array[*]})"
> +
> +# (example of setting default params in your script)
> +if [ -z "$DEST_IP" ]; then
> +    [ -z "$IP6" ] && DEST_IP="198.18.0.42" || DEST_IP="FD00::1"
> +fi
> +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
> +
> +# General cleanup everything since last run
> +pg_ctrl "reset"
> +
> +# Threads are specified with parameter -t value in $THREADS
> +for ((i = 0; i < $THREADS; i++)); do
> +    # The device name is extended with @name, using thread number to
> +    # make then unique, but any name will do.
> +    # Set the queue's irq affinity to this $thread (processor)
> +    thread=${cpu_array[$i]}
> +    dev=${DEV}@...hread}
> +    echo $thread > /proc/irq/${irq_array[$i]}/smp_affinity_list
> +    echo "irq ${irq_array[$i]} is set affinity to `cat /proc/irq/${irq_array[$i]}/smp_affinity_list`"
> +
> +    # Add remove all other devices and add_device $dev to thread
> +    pg_thread $thread "rem_device_all"
> +    pg_thread $thread "add_device" $dev
> +
> +    # select queue and bind the queue and $dev in 1:1 relationship
> +    queue_num=$i
> +    echo "queue number is $queue_num"
> +    pg_set $dev "queue_map_min $queue_num"
> +    pg_set $dev "queue_map_max $queue_num"
> +
> +    # Notice config queue to map to cpu (mirrors smp_processor_id())
> +    # It is beneficial to map IRQ /proc/irq/*/smp_affinity 1:1 to CPU number
> +    pg_set $dev "flag QUEUE_MAP_CPU"
> +
> +    # Base config of dev
> +    pg_set $dev "count $COUNT"
> +    pg_set $dev "clone_skb $CLONE_SKB"
> +    pg_set $dev "pkt_size $PKT_SIZE"
> +    pg_set $dev "delay $DELAY"
> +
> +    # Flag example disabling timestamping
> +    pg_set $dev "flag NO_TIMESTAMP"
> +
> +    # Destination
> +    pg_set $dev "dst_mac $DST_MAC"
> +    pg_set $dev "dst$IP6 $DEST_IP"
> +
> +    # Setup random UDP port src range
> +    pg_set $dev "flag UDPSRC_RND"
> +    pg_set $dev "udp_src_min $UDP_MIN"
> +    pg_set $dev "udp_src_max $UDP_MAX"
> +done
> +
> +# start_run
> +echo "Running... ctrl^C to stop" >&2
> +pg_ctrl "start"
> +echo "Done" >&2
> +
> +# Print results
> +for ((i = 0; i < $THREADS; i++)); do
> +    thread=${cpu_array[$i]}
> +    dev=${DEV}@...hread}
> +    echo "Device: $dev"
> +    cat /proc/net/pktgen/$dev | grep -A2 "Result:"
> +done



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer