netdev - RE: [PATCH net-next 6/6] enetc: Add adaptive interrupt coalescing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AM0PR04MB675470086CB8131D715D402A96610@AM0PR04MB6754.eurprd04.prod.outlook.com>
Date:   Tue, 14 Jul 2020 11:21:45 +0000
From:   Claudiu Manoil <claudiu.manoil@....com>
To:     Jakub Kicinski <kuba@...nel.org>
CC:     "David S . Miller" <davem@...emloft.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: RE: [PATCH net-next 6/6] enetc: Add adaptive interrupt coalescing

>-----Original Message-----
>From: Jakub Kicinski <kuba@...nel.org>
[...]
>Subject: Re: [PATCH net-next 6/6] enetc: Add adaptive interrupt coalescing
>
>On Mon, 13 Jul 2020 15:56:10 +0300 Claudiu Manoil wrote:
>> Use the generic dynamic interrupt moderation (dim)
>> framework to implement adaptive interrupt coalescing
>> in ENETC.  With the per-packet interrupt scheme, a high
>> interrupt rate has been noted for moderate traffic flows
>> leading to high CPU utilization.  The 'dim' scheme
>> implemented by the current patch addresses this issue
>> improving CPU utilization while using minimal coalescing
>> time thresholds in order to preserve a good latency.
>>
>> Below are some measurement results for before and after
>> this patch (and related dependencies) basically, for a
>> 2 ARM Cortex-A72 @1.3Ghz CPUs system (32 KB L1 data cache),
>> using netperf @ 1Gbit link (maximum throughput):
>>
>> 1) 1 Rx TCP flow, both Rx and Tx processed by the same NAPI
>> thread on the same CPU:
>> 	CPU utilization		int rate (ints/sec)
>> Before:	50%-60% (over 50%)		92k
>> After:  just under 50%			35k
>> Comment:  Small CPU utilization improvement for a single flow
>> 	  Rx TCP flow (i.e. netperf -t TCP_MAERTS) on a single
>> 	  CPU.
>>
>> 2) 1 Rx TCP flow, Rx processing on CPU0, Tx on CPU1:
>> 	Total CPU utilization	Total int rate (ints/sec)
>> Before:	60%-70%			85k CPU0 + 42k CPU1
>> After:  15%			3.5k CPU0 + 3.5k CPU1
>> Comment:  Huge improvement in total CPU utilization
>> 	  correlated w/a a huge decrease in interrupt rate.
>>
>> 3) 4 Rx TCP flows + 4 Tx TCP flows (+ pings to check the latency):
>> 	Total CPU utilization	Total int rate (ints/sec)
>> Before:	~80% (spikes to 90%)		~100k
>> After:   60% (more steady)		 ~10k
>> Comment:  Important improvement for this load test, while the
>> 	  ping test outcome was not impacted.
>>
>> Signed-off-by: Claudiu Manoil <claudiu.manoil@....com>
>
>Does it really make sense to implement DIM for TX?
>
>For TX the only thing we care about is that no queue in the system
>underflows. So the calculation is simply timeout = queue len / speed.
>The only problem is which queue in the system is the smallest (TX
>ring, TSQ etc.) but IMHO there's little point in the extra work to
>calculate the thresholds dynamically. On real life workloads the
>scheduler overhead the async work structs introduce cause measurable
>regressions.
>
>That's just to share my experience, up to you to decide if you want
>to keep the TX-side DIM or not :)

Yeah, I'm not happy either with Tx DIM, it seems too much for this device,
too much overhead.
But it seemed there's no other option left, because leaving coalescing as
disabled for Tx is not an option as there are too many Tx interrupts, but
on the other hand coming up with a single Tx coalescing time threshold to
cover all the possible cases is not feasible either.  However your suggestion
to compute the Tx coalescing values based on link speed, at least that's how
I read it, is worth investigating.  This device is supposed to handle link speeds
ranging from 10Mbit to 2.5G, so it would be great if TX DIM could be replaced
replaced in this case by a set of precomputed values based on link speed.
I'm going to look into this.  If you have any other suggestion on this pls let me know.
Thanks.
Claudiu