[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <E4A2D662-0C89-4B61-BB68-ED985D9EC4B3@redhat.com>
Date: Wed, 29 Aug 2018 11:43:47 +0200
From: "Eelco Chaudron" <echaudro@...hat.com>
To: "Jakub Kicinski" <jakub.kicinski@...ronome.com>
Cc: "David Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
jhs@...atatu.com, xiyou.wangcong@...il.com, jiri@...nulli.us,
simon.horman@...ronome.com,
"Marcelo Ricardo Leitner" <mleitner@...hat.com>,
louis.peens@...ronome.com
Subject: Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC
actions
On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:
> On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote:
>> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote:
>>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote:
>>>> On 11 Aug 2018, at 21:06, David Miller wrote:
>>>>
>>>>> From: Jakub Kicinski <jakub.kicinski@...ronome.com>
>>>>> Date: Thu, 9 Aug 2018 20:26:08 -0700
>>>>>
>>>>>> It is not immediately clear why this is needed. The memory and
>>>>>> updating two sets of counters won't come for free, so perhaps a
>>>>>> stronger justification than troubleshooting is due? :S
>>>>>>
>>>>>> Netdev has counters for fallback vs forwarded traffic, so you'd
>>>>>> know
>>>>>> that traffic hits the SW datapath, plus the rules which are in_hw
>>>>>> will
>>>>>> most likely not match as of today for flower (assuming
>>>>>> correctness).
>>>>
>>>> I strongly believe that these counters are a requirement for a
>>>> mixed
>>>> software/hardware (flow) based forwarding environment. The global
>>>> counters will not help much here as you might have chosen to have
>>>> certain traffic forwarded by software.
>>>>
>>>> These counters are probably the only option you have to figure out
>>>> why
>>>> forwarding is not as fast as expected, and you want to blame the TC
>>>> offload NIC.
>>>
>>> The suggested debugging flow would be:
>>> (1) check the global counter for fallback are incrementing;
>>> (2) find a flow with high stats but no in_hw flag set.
>>>
>>> The in_hw indication should be sufficient in most cases (unless
>>> there
>>> are shared blocks between netdevs of different ASICs...).
>>
>> I guess the aim is to find miss behaving hardware, i.e. having the
>> in_hw
>> flag set, but flows still coming to the kernel.
>
> For misbehaving hardware in_hw will not work indeed. Whether we need
> these extra always-on stats for such use case could be debated :)
>
>>>>>> I'm slightly concerned about potential performance impact, would
>>>>>> you
>>>>>> be able to share some numbers for non-trivial number of flows
>>>>>> (100k
>>>>>> active?)?
>>>>>
>>>>> Agreed, features used for diagnostics cannot have a harmful
>>>>> penalty
>>>>> for fast path performance.
>>>>
>>>> Fast path performance is not affected as these counters are not
>>>> incremented there. They are only incremented by the nic driver when
>>>> they
>>>> gather their statistics from hardware.
>>>
>>> Not by much, you are adding state to performance-critical
>>> structures,
>>> though, for what is effectively debugging purposes.
>>>
>>> I was mostly talking about the HW offload stat updates (sorry for
>>> not
>>> being clear).
>>>
>>> We can have some hundreds of thousands active offloaded flows, each
>>> of
>>> them can have multiple actions, and stats have to be updated
>>> multiple
>>> times per second and dumped probably around once a second, too. On
>>> a
>>> busy system the stats will get evicted from cache between each
>>> round.
>>>
>>> But I'm speculating let's see if I can get some numbers on it (if
>>> you
>>> could get some too, that would be great!).
>>
>> I’ll try to measure some of this later this week/early next week.
>
> I asked Louis to run some tests while I'm travelling, and he reports
> that my worry about reporting the extra stats was unfounded. Update
> function does not show up in traces at all. It seems under stress
> (generated with stress-ng) the thread dumping the stats in userspace
> (in OvS it would be the revalidator) actually consumes less CPU in
> __gnet_stats_copy_basic (0.4% less for ~2.0% total).
>
> Would this match with your results? I'm not sure why dumping would be
> faster with your change..
Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC
rules installed in HW.
For __gnet_stats_copy_basic() being faster I have (had) a theory. Now
this function is called twice, and I assumed the first call would cache
memory and the second call would be faster.
Sampling a lot of perf data, I get an average of 1115ns with the base
kernel and 954ns with the fix applied, so about ~14%.
Thought I would perf tcf_action_copy_stats() as it is the place updating
the additional counter. But even in this case, I see a better
performance with the patch applied.
In average 13581ns with the fix, vs base kernel at 1391ns, so about
2.3%.
I guess the changes to the tc_action structure got better cache
alignment.
>>>> However, the flow creation is effected, as this is where the extra
>>>> memory gets allocated. I had done some 40K flow tests before and
>>>> did
>>>> not
>>>> see any noticeable change in flow insertion performance. As
>>>> requested
>>>> by
>>>> Jakub I did it again for 100K (and threw a Netronome blade in the
>>>> mix
>>>> ;). I used Marcelo’s test tool,
>>>> https://github.com/marceloleitner/perf-flower.git.
>>>>
>>>> Here are the numbers (time in seconds) for 10 runs in sorted order:
>>>>
>>>> +-------------+----------------+
>>>> | Base_kernel | Change_applied |
>>>> +-------------+----------------+
>>>> | 5.684019 | 5.656388 |
>>>> | 5.699658 | 5.674974 |
>>>> | 5.725220 | 5.722107 |
>>>> | 5.739285 | 5.839855 |
>>>> | 5.748088 | 5.865238 |
>>>> | 5.766231 | 5.873913 |
>>>> | 5.842264 | 5.909259 |
>>>> | 5.902202 | 5.912685 |
>>>> | 5.905391 | 5.947138 |
>>>> | 6.032997 | 5.997779 |
>>>> +-------------+----------------+
>>>>
>>>> I guess the deviation is in the userspace part, which is where in
>>>> real
>>>> life flows get added anyway.
>>>>
>>>> Let me know if more is unclear.
Powered by blists - more mailing lists