[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180829201211.6c58b827@cakuba.netronome.com>
Date: Wed, 29 Aug 2018 20:12:11 +0200
From: Jakub Kicinski <jakub.kicinski@...ronome.com>
To: "Eelco Chaudron" <echaudro@...hat.com>
Cc: "David Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
jhs@...atatu.com, xiyou.wangcong@...il.com, jiri@...nulli.us,
simon.horman@...ronome.com,
"Marcelo Ricardo Leitner" <mleitner@...hat.com>,
louis.peens@...ronome.com
Subject: Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC
actions
On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote:
> On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:
>
> > On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote:
> >> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote:
> >>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote:
> >>>> On 11 Aug 2018, at 21:06, David Miller wrote:
> >>>>
> >>>>> From: Jakub Kicinski <jakub.kicinski@...ronome.com>
> >>>>> Date: Thu, 9 Aug 2018 20:26:08 -0700
> >>>>>
> >>>>>> It is not immediately clear why this is needed. The memory and
> >>>>>> updating two sets of counters won't come for free, so perhaps a
> >>>>>> stronger justification than troubleshooting is due? :S
> >>>>>>
> >>>>>> Netdev has counters for fallback vs forwarded traffic, so you'd
> >>>>>> know
> >>>>>> that traffic hits the SW datapath, plus the rules which are in_hw
> >>>>>> will
> >>>>>> most likely not match as of today for flower (assuming
> >>>>>> correctness).
> >>>>
> >>>> I strongly believe that these counters are a requirement for a
> >>>> mixed
> >>>> software/hardware (flow) based forwarding environment. The global
> >>>> counters will not help much here as you might have chosen to have
> >>>> certain traffic forwarded by software.
> >>>>
> >>>> These counters are probably the only option you have to figure out
> >>>> why
> >>>> forwarding is not as fast as expected, and you want to blame the TC
> >>>> offload NIC.
> >>>
> >>> The suggested debugging flow would be:
> >>> (1) check the global counter for fallback are incrementing;
> >>> (2) find a flow with high stats but no in_hw flag set.
> >>>
> >>> The in_hw indication should be sufficient in most cases (unless
> >>> there
> >>> are shared blocks between netdevs of different ASICs...).
> >>
> >> I guess the aim is to find miss behaving hardware, i.e. having the
> >> in_hw
> >> flag set, but flows still coming to the kernel.
> >
> > For misbehaving hardware in_hw will not work indeed. Whether we need
> > these extra always-on stats for such use case could be debated :)
> >
> >>>>>> I'm slightly concerned about potential performance impact, would
> >>>>>> you
> >>>>>> be able to share some numbers for non-trivial number of flows
> >>>>>> (100k
> >>>>>> active?)?
> >>>>>
> >>>>> Agreed, features used for diagnostics cannot have a harmful
> >>>>> penalty
> >>>>> for fast path performance.
> >>>>
> >>>> Fast path performance is not affected as these counters are not
> >>>> incremented there. They are only incremented by the nic driver when
> >>>> they
> >>>> gather their statistics from hardware.
> >>>
> >>> Not by much, you are adding state to performance-critical
> >>> structures,
> >>> though, for what is effectively debugging purposes.
> >>>
> >>> I was mostly talking about the HW offload stat updates (sorry for
> >>> not
> >>> being clear).
> >>>
> >>> We can have some hundreds of thousands active offloaded flows, each
> >>> of
> >>> them can have multiple actions, and stats have to be updated
> >>> multiple
> >>> times per second and dumped probably around once a second, too. On
> >>> a
> >>> busy system the stats will get evicted from cache between each
> >>> round.
> >>>
> >>> But I'm speculating let's see if I can get some numbers on it (if
> >>> you
> >>> could get some too, that would be great!).
> >>
> >> I’ll try to measure some of this later this week/early next week.
> >
> > I asked Louis to run some tests while I'm travelling, and he reports
> > that my worry about reporting the extra stats was unfounded. Update
> > function does not show up in traces at all. It seems under stress
> > (generated with stress-ng) the thread dumping the stats in userspace
> > (in OvS it would be the revalidator) actually consumes less CPU in
> > __gnet_stats_copy_basic (0.4% less for ~2.0% total).
> >
> > Would this match with your results? I'm not sure why dumping would be
> > faster with your change..
>
> Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC
> rules installed in HW.
>
> For __gnet_stats_copy_basic() being faster I have (had) a theory. Now
> this function is called twice, and I assumed the first call would cache
> memory and the second call would be faster.
>
> Sampling a lot of perf data, I get an average of 1115ns with the base
> kernel and 954ns with the fix applied, so about ~14%.
>
> Thought I would perf tcf_action_copy_stats() as it is the place updating
> the additional counter. But even in this case, I see a better
> performance with the patch applied.
>
> In average 13581ns with the fix, vs base kernel at 1391ns, so about
> 2.3%.
>
> I guess the changes to the tc_action structure got better cache
> alignment.
Interesting you could reproduce the speed up too! +1 for the guess.
Seems like my caution about slowing down SW paths to support HW offload
landed on a very unfortunate patch :)
Powered by blists - more mailing lists