netdev - Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190710112011.GA552@splinter>
Date:   Wed, 10 Jul 2019 14:20:11 +0300
From:   Ido Schimmel <idosch@...sch.org>
To:     Jakub Kicinski <jakub.kicinski@...ronome.com>
Cc:     David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
        jiri@...lanox.com, mlxsw@...lanox.com, dsahern@...il.com,
        roopa@...ulusnetworks.com, nikolay@...ulusnetworks.com,
        andy@...yhouse.net, pablo@...filter.org,
        pieter.jansenvanvuuren@...ronome.com, andrew@...n.ch,
        f.fainelli@...il.com, vivien.didelot@...il.com,
        idosch@...lanox.com,
        Alexei Starovoitov <alexei.starovoitov@...il.com>
Subject: Re: [PATCH net-next 00/11] Add drop monitor for offloaded data paths

On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote:
> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote:
> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote:
> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote:  
> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote:  
> > > > > From: Ido Schimmel <idosch@...sch.org>
> > > > > Date: Sun,  7 Jul 2019 10:58:17 +0300
> > > > >     
> > > > > > Users have several ways to debug the kernel and understand why a packet
> > > > > > was dropped. For example, using "drop monitor" and "perf". Both
> > > > > > utilities trace kfree_skb(), which is the function called when a packet
> > > > > > is freed as part of a failure. The information provided by these tools
> > > > > > is invaluable when trying to understand the cause of a packet loss.
> > > > > > 
> > > > > > In recent years, large portions of the kernel data path were offloaded
> > > > > > to capable devices. Today, it is possible to perform L2 and L3
> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN).
> > > > > > Different TC classifiers and actions are also offloaded to capable
> > > > > > devices, at both ingress and egress.
> > > > > > 
> > > > > > However, when the data path is offloaded it is not possible to achieve
> > > > > > the same level of introspection as tools such "perf" and "drop monitor"
> > > > > > become irrelevant.
> > > > > > 
> > > > > > This patchset aims to solve this by allowing users to monitor packets
> > > > > > that the underlying device decided to drop along with relevant metadata
> > > > > > such as the drop reason and ingress port.    
> > > > > 
> > > > > We are now going to have 5 or so ways to capture packets passing through
> > > > > the system, this is nonsense.
> > > > > 
> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this
> > > > > devlink thing.
> > > > > 
> > > > > This is insanity, too many ways to do the same thing and therefore the
> > > > > worst possible user experience.
> > > > > 
> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events,
> > > > > XDP perf events, and these taps there too.
> > > > > 
> > > > > I mean really, think about it from the average user's perspective.  To
> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just
> > > > > listen on devlink but configure a special tap thing beforehand and then
> > > > > if someone is using XDP I gotta setup another perf event buffer capture
> > > > > thing too.    
> > > > 
> > > > Let me try to explain again because I probably wasn't clear enough. The
> > > > devlink-trap mechanism is not doing the same thing as other solutions.
> > > > 
> > > > The packets we are capturing in this patchset are packets that the
> > > > kernel (the CPU) never saw up until now - they were silently dropped by
> > > > the underlying device performing the packet forwarding instead of the
> > > > CPU.  
> > 
> > Jakub,
> > 
> > It seems to me that most of the criticism is about consolidation of
> > interfaces because you believe I'm doing something you can already do
> > today, but this is not the case.
> 
> To be clear I'm not opposed to the patches, I'm just trying to
> facilitate a discussion.

Sure, sorry if it came out the wrong way. I appreciate your feedback and
the time you have spent on this subject.

> 
> > Switch ASICs have dedicated traps for specific packets. Usually, these
> > packets are control packets (e.g., ARP, BGP) which are required for the
> > correct functioning of the control plane. You can see this in the SAI
> > interface, which is an abstraction layer over vendors' SDKs:
> > 
> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157
> > 
> > We need to be able to configure the hardware policers of these traps and
> > read their statistics to understand how many packets they dropped. We
> > currently do not have a way to do any of that and we rely on hardcoded
> > defaults in the driver which do not fit every use case (from
> > experience):
> > 
> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103
> > 
> > We plan to extend devlink-trap mechanism to cover all these use cases. I
> > hope you agree that this functionality belongs in devlink given it is a
> > device-specific configuration and not a netdev-specific one.
> 
> No disagreement on providing knobs for traps.
> 
> > That being said, in its current form, this mechanism is focused on traps
> > that correlate to packets the device decided to drop as this is very
> > useful for debugging.
> 
> That'd be mixing two things - trap configuration and tracing exceptions
> in one API. That's a little suboptimal but not too terrible, especially
> if there is a higher level APIs users can default to.

TBH, initially I was only focused on the drops, but then it occurred to
me that this is a too narrow scope. These traps are only a subset of the
complete list of traps we have and we have similar requirements for both
(statistics, setting policers etc.). Therefore, I decided to design this
interface in a more generic way, so that it could support the different
use cases.

> 
> > Given that the entire configuration is done via devlink and that devlink
> > stores all the information about these traps, it seems logical to also
> > report these packets and their metadata to user space as devlink events.
> > 
> > If this is not desirable, we can try to call into drop_monitor from
> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will
> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT.
> > 
> > IMO, this is less desirable, as instead of having one tool (devlink) to
> > interact with this mechanism we will need two (devlink & dropwatch).
> > 
> > Below I tried to answer all your questions and refer to all the points
> > you brought up.
> > 
> > > When you say silently dropped do you mean that mlxsw as of today
> > > doesn't have any counters exposed for those events?  
> > 
> > Some of these packets are counted, but not all of them.
> > 
> > > If we wanted to consolidate this into something existing we can either
> > >  (a) add similar traps in the kernel data path;
> > >  (b) make these traps extension of statistics.
> > > 
> > > My knee jerk reaction to seeing the patches was that it adds a new
> > > place where device statistics are reported.  
> > 
> > Not at all. This would be a step back. We can already count discards due
> > to VLAN membership on ingress on a per-port basis. A software maintained
> > global counter does not buy us anything.
> > 
> > By also getting the dropped packet - coupled with the drop reason and
> > ingress port - you can understand exactly why and on which VLAN the
> > packet was dropped. I wrote a Wireshark dissector for these netlink
> > packets to make our life easier. You can see the details in my comment
> > to the cover letter:
> > 
> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2
> > 
> > In case you do not care about individual packets, but still want more
> > fine-grained statistics for your monitoring application, you can use
> > eBPF. For example, one thing we did is attaching a kprobe to
> > devlink_trap_report() with an eBPF program that dissects the incoming
> > skbs and maintains a counter per-{5 tuple, drop reason}. With
> > ebpf_exporter you can export these statistics to Prometheus on which you
> > can run queries and visualize the results with Grafana. This is
> > especially useful for tail and early drops since it allows you to
> > understand which flows contribute to most of the drops.
> 
> No question that the mechanism is useful.
> 
> > > Users who want to know why things are dropped will not get detailed
> > > breakdown from ethtool -S which for better or worse is the one stop
> > > shop for device stats today.  
> > 
> > I hope I managed to explain why counters are not enough, but I also want
> > to point out that ethtool statistics are not properly documented and
> > this hinders their effectiveness. I did my best to document the exposed
> > traps in order to avoid the same fate:
> > 
> > https://patchwork.ozlabs.org/patch/1128585/
> > 
> > In addition, there are selftests to show how each trap can be triggered
> > to reduce the ambiguity even further:
> > 
> > https://patchwork.ozlabs.org/patch/1128610/
> > 
> > And a note in the documentation to make sure future functionality is
> > tested as well:
> > 
> > https://patchwork.ozlabs.org/patch/1128608/
> > 
> > > Having thought about it some more, however, I think that having a
> > > forwarding "exception" object and hanging statistics off of it is a
> > > better design, even if we need to deal with some duplication to get
> > > there.
> > > 
> > > IOW having an way to "trap all packets which would increment a
> > > statistic" (option (b) above) is probably a bad design.
> > > 
> > > As for (a) I wonder how many of those events have a corresponding event
> > > in the kernel stack?  
> > 
> > Generic packet drops all have a corresponding kfree_skb() calls in the
> > kernel, but that does not mean that every packet dropped by the hardware
> > would also be dropped by the kernel if it were to be injected to its Rx
> > path.
> 
> The notion that all SW events get captured by kfree_skb() would not be
> correct.

I meant that the generic drop reasons I'm exposing with this patchset
all correspond to reasons for which the kernel would drop packets.

> We have the kfree_skb(), and xdp_exception(), and drivers can
> drop packets if various allocations fail.. the situation is already not
> great.
> 
> I think that having a single useful place where users can look to see
> all traffic exception events would go a long way. 

I believe this was Dave's point as well. We have one tool to monitor
kfree_skb() drops and with this patchset we will have another to monitor
HW drops. As I mentioned in my previous reply, I will look into sending
the events via drop_monitor by calling into it from devlink.

I'm not involved with XDP (as you might have noticed), but I assume
drop_monitor could be extended for this use case as well by doing
register_trace_xdp_exception(). Then you could monitor SW, HW and XDP
events using a single netlink channel, potentially split into different
multicast groups to allow user space programs to receive only the events
they care about.

> Software side as I mentioned is pretty brutal, IDK how many users are
> actually willing to decode stack traces to figure out why their system
> is dropping packets :/
> 
> > In my reply to Dave I gave buffer drops as an example.
> 
> The example of buffer drops is also probably the case where having the
> packet is least useful, but yes, I definitely agree devices need a way
> of reporting events that can't happen in SW.
> 
> > There are also situations in which packets can be dropped due to
> > device-specific exceptions and these do not have a corresponding drop
> > reason in the kernel. See example here:
> > 
> > https://patchwork.ozlabs.org/patch/1128587/
> > 
> > > If we could add corresponding trace points and just feed those from
> > > the device driver, that'd obviously be a holy grail.  
> > 
> > Unlike tracepoints, netlink gives you a structured and extensible
> > interface. For example, in Spectrum-1 we cannot provide the Tx port for
> > early/tail drops, whereas for Spectrum-2 and later we can. With netlink,
> > we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for
> > Spectrum-1. You also get a programmatic interface that you can query for
> > this information:
> > 
> > # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter
> > netdevsim/netdevsim10:
> >   name ingress_vlan_filter type drop generic true report false action drop group l2_drops
> >     metadata:
> >        input_port
> 
> Right, you can set or not set skb fields to some extent but its
> definitely not as flexible as netlink.