netdev - Re: [Patch net-next] net: add a generic tracepoint for TX queue timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <68f5b7e3-4022-edd4-8d18-752b3dfc500f@mellanox.com>
Date:   Wed, 1 May 2019 13:11:11 +0000
From:   Eran Ben Elisha <eranbe@...lanox.com>
To:     Cong Wang <xiyou.wangcong@...il.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:     Jiri Pirko <jiri@...lanox.com>
Subject: Re: [Patch net-next] net: add a generic tracepoint for TX queue
 timeout



On 4/30/2019 9:50 PM, Cong Wang wrote:
> Although devlink health report does a nice job on reporting TX
> timeout and other NIC errors, unfortunately it requires drivers
> to support it but currently only mlx5 has implemented it.

The devlink health was never intended to be the generic mechanism for 
monitoring all driver's TX timeouts notifications. mlx5e driver chose to 
handle TX timeout notification by reporting it to the newly devlink 
health mechanism.

> Before other drivers could catch up, it is useful to have a
> generic tracepoint to monitor this kind of TX timeout. We have
> been suffering TX timeout with different drivers, we plan to
> start to monitor it with rasdaemon which just needs a new tracepoint.

Great idea to suggest a generic trace message that can be monitored over 
all drivers.

> 
> Sample output:
> 
>    ksoftirqd/1-16    [001] ..s2   144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0
> 
> Cc: Eran Ben Elisha <eranbe@...lanox.com>
> Cc: Jiri Pirko <jiri@...lanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
> ---
>   include/trace/events/net.h | 23 +++++++++++++++++++++++
>   net/sched/sch_generic.c    |  2 ++
>   2 files changed, 25 insertions(+)
> 
> diff --git a/include/trace/events/net.h b/include/trace/events/net.h
> index 1efd7d9b25fe..002d6f04b9e5 100644
> --- a/include/trace/events/net.h
> +++ b/include/trace/events/net.h
> @@ -303,6 +303,29 @@ DEFINE_EVENT(net_dev_rx_exit_template, netif_receive_skb_list_exit,
>   	TP_ARGS(ret)
>   );
>   

I would have put this next to net_dev_xmit trace event declaration.

> +TRACE_EVENT(net_dev_xmit_timeout,
> +
> +	TP_PROTO(struct net_device *dev,
> +		 int queue_index),
> +
> +	TP_ARGS(dev, queue_index),
> +
> +	TP_STRUCT__entry(
> +		__string(	name,		dev->name	)
> +		__string(	driver,		netdev_drivername(dev))
> +		__field(	int,		queue_index	)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(name, dev->name);
> +		__assign_str(driver, netdev_drivername(dev));
> +		__entry->queue_index = queue_index;
> +	),
> +
> +	TP_printk("dev=%s driver=%s queue=%d",
> +		__get_str(name), __get_str(driver), __entry->queue_index)
> +);
> +
>   #endif /* _TRACE_NET_H */
>   
>   /* This part must be outside protection */
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 848aab3693bd..cce1e9ee85af 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -32,6 +32,7 @@
>   #include <net/pkt_sched.h>
>   #include <net/dst.h>
>   #include <trace/events/qdisc.h>
> +#include <trace/events/net.h>
>   #include <net/xfrm.h>
>   
>   /* Qdisc to use by default */
> @@ -441,6 +442,7 @@ static void dev_watchdog(struct timer_list *t)
>   			}
>   
>   			if (some_queue_timedout) {
> +				trace_net_dev_xmit_timeout(dev, i);
>   				WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>   				       dev->name, netdev_drivername(dev), i);
>   				dev->netdev_ops->ndo_tx_timeout(dev);
>