netdev - Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <E4A2D662-0C89-4B61-BB68-ED985D9EC4B3@redhat.com>
Date:   Wed, 29 Aug 2018 11:43:47 +0200
From:   "Eelco Chaudron" <echaudro@...hat.com>
To:     "Jakub Kicinski" <jakub.kicinski@...ronome.com>
Cc:     "David Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
        jhs@...atatu.com, xiyou.wangcong@...il.com, jiri@...nulli.us,
        simon.horman@...ronome.com,
        "Marcelo Ricardo Leitner" <mleitner@...hat.com>,
        louis.peens@...ronome.com
Subject: Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC
 actions



On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:

> On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote:
>> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote:
>>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote:
>>>> On 11 Aug 2018, at 21:06, David Miller wrote:
>>>>
>>>>> From: Jakub Kicinski <jakub.kicinski@...ronome.com>
>>>>> Date: Thu, 9 Aug 2018 20:26:08 -0700
>>>>>
>>>>>> It is not immediately clear why this is needed.  The memory and
>>>>>> updating two sets of counters won't come for free, so perhaps a
>>>>>> stronger justification than troubleshooting is due? :S
>>>>>>
>>>>>> Netdev has counters for fallback vs forwarded traffic, so you'd
>>>>>> know
>>>>>> that traffic hits the SW datapath, plus the rules which are in_hw
>>>>>> will
>>>>>> most likely not match as of today for flower (assuming
>>>>>> correctness).
>>>>
>>>> I strongly believe that these counters are a requirement for a 
>>>> mixed
>>>> software/hardware (flow) based forwarding environment. The global
>>>> counters will not help much here as you might have chosen to have
>>>> certain traffic forwarded by software.
>>>>
>>>> These counters are probably the only option you have to figure out
>>>> why
>>>> forwarding is not as fast as expected, and you want to blame the TC
>>>> offload NIC.
>>>
>>> The suggested debugging flow would be:
>>>  (1) check the global counter for fallback are incrementing;
>>>  (2) find a flow with high stats but no in_hw flag set.
>>>
>>> The in_hw indication should be sufficient in most cases (unless 
>>> there
>>> are shared blocks between netdevs of different ASICs...).
>>
>> I guess the aim is to find miss behaving hardware, i.e. having the 
>> in_hw
>> flag set, but flows still coming to the kernel.
>
> For misbehaving hardware in_hw will not work indeed.  Whether we need
> these extra always-on stats for such use case could be debated :)
>
>>>>>> I'm slightly concerned about potential performance impact, would
>>>>>> you
>>>>>> be able to share some numbers for non-trivial number of flows 
>>>>>> (100k
>>>>>> active?)?
>>>>>
>>>>> Agreed, features used for diagnostics cannot have a harmful 
>>>>> penalty
>>>>> for fast path performance.
>>>>
>>>> Fast path performance is not affected as these counters are not
>>>> incremented there. They are only incremented by the nic driver when
>>>> they
>>>> gather their statistics from hardware.
>>>
>>> Not by much, you are adding state to performance-critical 
>>> structures,
>>> though, for what is effectively debugging purposes.
>>>
>>> I was mostly talking about the HW offload stat updates (sorry for 
>>> not
>>> being clear).
>>>
>>> We can have some hundreds of thousands active offloaded flows, each 
>>> of
>>> them can have multiple actions, and stats have to be updated 
>>> multiple
>>> times per second and dumped probably around once a second, too.  On 
>>> a
>>> busy system the stats will get evicted from cache between each 
>>> round.
>>>
>>> But I'm speculating let's see if I can get some numbers on it (if 
>>> you
>>> could get some too, that would be great!).
>>
>> I’ll try to measure some of this later this week/early next week.
>
> I asked Louis to run some tests while I'm travelling, and he reports
> that my worry about reporting the extra stats was unfounded.  Update
> function does not show up in traces at all.  It seems under stress
> (generated with stress-ng) the thread dumping the stats in userspace
> (in OvS it would be the revalidator) actually consumes less CPU in
> __gnet_stats_copy_basic (0.4% less for ~2.0% total).
>
> Would this match with your results?  I'm not sure why dumping would be
> faster with your change..

Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC 
rules installed in HW.

For __gnet_stats_copy_basic() being faster I have (had) a theory. Now 
this function is called twice, and I assumed the first call would cache 
memory and the second call would be faster.

Sampling a lot of perf data, I get an average of 1115ns with the base 
kernel and 954ns with the fix applied, so about ~14%.

Thought I would perf tcf_action_copy_stats() as it is the place updating 
the additional counter. But even in this case, I see a better 
performance with the patch applied.

In average 13581ns with the fix, vs base kernel at 1391ns, so about 
2.3%.

I guess the changes to the tc_action structure got better cache 
alignment.


>>>> However, the flow creation is effected, as this is where the extra
>>>> memory gets allocated. I had done some 40K flow tests before and 
>>>> did
>>>> not
>>>> see any noticeable change in flow insertion performance. As 
>>>> requested
>>>> by
>>>> Jakub I did it again for 100K (and threw a Netronome blade in the 
>>>> mix
>>>> ;). I used Marcelo’s test tool,
>>>> https://github.com/marceloleitner/perf-flower.git.
>>>>
>>>> Here are the numbers (time in seconds) for 10 runs in sorted order:
>>>>
>>>> +-------------+----------------+
>>>> | Base_kernel | Change_applied |
>>>> +-------------+----------------+
>>>> |    5.684019 |       5.656388 |
>>>> |    5.699658 |       5.674974 |
>>>> |    5.725220 |       5.722107 |
>>>> |    5.739285 |       5.839855 |
>>>> |    5.748088 |       5.865238 |
>>>> |    5.766231 |       5.873913 |
>>>> |    5.842264 |       5.909259 |
>>>> |    5.902202 |       5.912685 |
>>>> |    5.905391 |       5.947138 |
>>>> |    6.032997 |       5.997779 |
>>>> +-------------+----------------+
>>>>
>>>> I guess the deviation is in the userspace part, which is where in
>>>> real
>>>> life flows get added anyway.
>>>>
>>>> Let me know if more is unclear.