netdev - Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <502E2AD9.5060309@freescale.com>
Date:	Fri, 17 Aug 2012 14:28:25 +0300
From:	Claudiu Manoil <claudiu.manoil@...escale.com>
To:	Paul Gortmaker <paul.gortmaker@...driver.com>
CC:	Tomas Hruby <thruby@...il.com>,
	Eric Dumazet <eric.dumazet@...il.com>,
	<netdev@...r.kernel.org>, "David S. Miller" <davem@...emloft.net>
Subject: Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation
 processing

On 08/16/2012 06:36 PM, Paul Gortmaker wrote:
> [Re: [RFC net-next 0/4] gianfar: Use separate NAPI for Tx confirmation processing] On 14/08/2012 (Tue 19:08) Claudiu Manoil wrote:
>
>> On 08/14/2012 04:15 AM, Paul Gortmaker wrote:
>>> This is a lot lower variation than what you reported earlier (20
>>> versus 200, I think). It was the variation that raised a red flag
>>> for me...
>> Hi Paul,
>> The earlier variation, which is much bigger (indeed ~200), was
>> observed on a p1020 (slow, 2 cores, MQ_MG_MODE).
>> I did not collect however as detailed measurement results for that
>> board, as I did for p1010 (previous email).
>> The most important performance improvement I've noticed however was
>> on the p1020 platform.
>>
>>>> By changing the coalescing settings from default* (rx coalescing off,
>>>> tx-usecs: 10, tx-frames: 16) to:
>>>> ""
>>>> we get a throughput of ~710 Mbps.
>>>>
>>>> For *Image 2)*, using the default tcp_limit_output_bytes value
>>>> (131072) - I've noticed
>>>> that "tweaking" tcp_limit_output_bytes does not improve the
>>>> throughput -, we get the
>>>> following performance numbers:
>>>> * default coalescing settings: ~650 Mbps
>>>> * rx-frames tx-frames 22 rx-usecs 32 tx-usecs 32: ~860-880 Mbps
>>>>
>>>> For *Image 3)*, by disabling BQL (CONFIG_BQL = n), there's *no*
>>>> relevant performance
>>>> improvement compared to Image 1).
>>>> (note:
>>>> For all the measurements, rx and tx BD ring sizes have been set to
>>>> 64, for best performance.)
>>>>
>>>> So, I really tend to believe that the performance degradation comes
>>>> primarily from the driver,
>>>> and the napi poll processing turns out to be an important source for
>>>> that. The proposed patches
>>> This would make sense, if the CPU was slammed at 100% load in dealing
>>> with the tx processing, and the change made the driver considerably more
>>> efficient.  But is that really the case?  Is the p1010 really going flat
>>> out just to handle the Tx processing?  Have you done any sort of
>>> profiling to confirm/deny where the CPU is spending its time?
>> The current gfar_poll implementation processes first the tx
>> confirmation path exhaustively, without a budget/ work limit,
>> and only then proceeds with the rx processing within the allotted
>> budget. An this happens for both Rx and Tx confirmation
>> interrupts. I find this unfair and out of balance. Maybe by letting
>> rx processing to be triggered by rx interrupts only, and
>> the tx conf path processing to be triggered by tx confirmation
>> interrupts only, and, on top of that, by imposing a work limit
>> on the tx confirmation path too, we get a more responsive driver
>> that performs better. Indeed some profiling data to
>> confirm this would be great, but I don't have it.
>>
>> There's another issues that seems to be solved by this patchset, and
>> I've noticed it only on p1020rdb (this time).
>> And that is excessive Rx busy interrupts occurrence. Solving this
>> issue may be another factor for the performance
>> improvement on p1020. But maybe this is another discussion.
>>
>>>> show substantial improvement, especially for SMP systems where Tx
>>>> and Rx processing may be
>>>> done in parallel.
>>>> What do you think?
>>>> Is it ok to proceed by re-spinning the patches? Do you recommend
>>>> additional measurements?
>>> Unfortunately Eric is out this week, so we will be without his input for
>>> a while.  However, we are only at 3.6-rc1 -- meaning net-next will be
>>> open for quite some time, hence no need to rush to try and jam stuff in.
>>>
>>> Also, I have two targets I'm interested in testing your patches on.  The
>>> 1st is a 500MHz mpc8349 board -- which should replicate what you see on
>>> your p1010 (slow, single core).  The other is an 8641D, which is
>>> interesting since it will give us the SMP tx/rx as separate threads, but
>>> without the MQ_MG_MODE support (is that a correct assumption?)
>>>
>>> I don't have any fundamental problem with your patches (although 4/4
>>> might be better as two patches) -- the above targets/tests are only
>>> of interest, since I'm not convinced we yet understand _why_ your
>>> changes give a performance boost, and there might be something
>>> interesting hiding in there.
>>>
>>> So, while Eric is out, let me see if I can collect some more data on
>>> those two targets sometime this week.
>> Great, I don't mean to rush.  The more data we get on this the better.
>> It would be great if you could do some measurements on your platforms too.
>> 8641D is indeed a dual core with etsec 1.x (so without the MQ_MG_MODE),
>> but I did run some tests on a p2020, which has the same features. However
>> I'm eager to see your results.
> So, I've collected data on 8349 (520MHz single core) and 8641D (1GHz
> dual core) and the results are kind of surprising (to me).  The SMP
> target, which in theory should have benefited from the change, actually
> saw about an 8% reduction in throughput.  And the slower single core saw
> about a 5% increase.
>
> I also retested the 8641D with just your 1st 3 patches (i.e. drop the
> "Use separate NAPIs for Tx and Rx processing" patch) and it recovered
> about 1/2 the lost throughput, but not all.
>
> I've used your patches exactly as posted, and the same netperf cmdline.
> I briefly experimented with disabling BQL on the 8349 but didn't see any
> impact from doing that (consistent with what you'd reported).  I didn't
> see any real large variations either (target and server on same switch),
> but I'm thinking the scatter could be reduced further if I isolated
> the switch entirely to just the target and server.  I'll do that if I
> end up doing any more testing on this, since the averages seem to be
> reproduceable to about +/- 2% at the moment...
>
> Paul.
>
>   --------------
> Command: netperf -l 20 -cC -H 192.168.146.65 -t TCP_STREAM -- -m 1500
> next-next baseline: commit 1f07b62f3205f6ed41759df2892eaf433bc051a1
> fsl RFC:  http://patchwork.ozlabs.org/patch/175919/ applied to above.
> Default queue sizes (256), BQL defaults.
>
> 8349 (528 MHz, single core):
>     net-next 10 runs
>     avg=123
>     max=124
>     min=121
>     send utilization > 99%
>     
>     fsl RFC 13 runs:
>     avg=129 (+ ~5%)
>     max=131
>     min=127
>     send utilization > 99%
>     
> 8641D: (1GHz, dual core)
>     net-next 10 runs
>     avg=826
>     max=839
>     min=807
>     send utilization ~ 70%
>
>     fsl RFC 12 runs
>     avg=762 (- ~8%)
>     max=783
>     min=698
>     send utilization ~ 70%
>
>     fsl RFC, _only_ 1st 3 of 4 patches, 13 runs
>     avg=794 (- ~4%)
>     max=816
>     min=758
>     send utilization ~ 70%
>   --------------
Hello Paul,
Thanks again for the measurements. It will take me some time to "digest"
the results and even do more tests/ analysis on the platforms at my 
disposal.
Your results are indeed surprising, but there are some noticeable 
differences
b/w our setups too.
First of all, as noted before, I'm using BD rings of size 64 for best 
performance
(as this proved to be an optimal setting over the time). So, before 
starting any
tests, I was issuing: "ethtool -G <ethX> rx 64 tx 64".
Another point is that, to enhance the performance gain, I was using some
"decent" interrupt coalescing settings (at least to have rx coalescing 
enabled
too, which is by default off). So I've been using:
"ethtool -C eth1 rx-frames 22 rx-usecs 32 tx-frames 22 tx-usecs 32"
I think that the proposed code enhancement requires some balanced interrupt
coalescing settings too, for best results.

It's interesting that with these settings, I was able to reach ~940 Mbps 
on a p2020rdb
(which is also "single queue single group", but with two 1.2GHz e500v2 
cores), both with
and without the RFC patches.

Another point to consider when doing these measurements on SMP systems,
is that Rx/Tx interrupt handling should happen on distinct CPUs.
I think this happens by default for netperf on the "non-MQ_MG_MODE" systems
(like 8641D, or p2020), but this condition must be verified for the 
MQ_MG_MODE
systems (like p1020), by checking /proc/interrupts, and (if needed) 
forced by setting
interrupt affinities accordingly.
Btw., do you happen to have a p1020 board at your disposal too?

Best regards,
Claudiu


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html