netdev - Re: [RFC PATCH 1/5] net: implement support for low latency socket polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 5 Mar 2013 16:43:01 +0000
From:	Ben Hutchings <bhutchings@...arflare.com>
To:	Eliezer Tamir <eliezer.tamir@...ux.jf.intel.com>
CC:	<linux-kernel@...r.kernel.org>, <netdev@...r.kernel.org>,
	Dave Miller <davem@...emloft.net>,
	Jesse Brandeburg <jesse.brandeburg@...el.com>,
	<e1000-devel@...ts.sourceforge.net>,
	Willem de Bruijn <willemb@...gle.com>,
	Andi Kleen <andi@...stfloor.org>, HPA <hpa@...or.com>,
	Eliezer Tamir <eliezer@...ir.org.il>
Subject: Re: [RFC PATCH 1/5] net: implement support for low latency socket
 polling

On Wed, 2013-02-27 at 09:55 -0800, Eliezer Tamir wrote:
> Adds a new ndo_ll_poll method and the code that supports and uses it.
> This method can be used by low latency applications to busy poll ethernet
> device queues directly from the socket code. The ip_low_latency_poll sysctl
> entry controls how many cycles to poll. Set to zero to disable.
[...]
> --- /dev/null
> +++ b/include/net/ll_poll.h
> @@ -0,0 +1,71 @@
> +/*
> + * low latency device queue flush
> + */
> +
> +#ifndef _LINUX_NET_LL_POLL_H
> +#define _LINUX_NET_LL_POLL_H
> +#ifdef CONFIG_INET_LL_RX_POLL
> +#include <linux/netdevice.h>
> +struct napi_struct;
> +extern int sysctl_net_ll_poll __read_mostly;
> +
> +/* return values from ndo_ll_poll */
> +#define LL_FLUSH_DONE		0
> +#define LL_FLUSH_FAILED		1
> +#define LL_FLUSH_BUSY		2
> +
> +static inline int sk_valid_ll(struct sock *sk)

bool

> +{
> +	return sysctl_net_ll_poll && sk->dev_ref &&
> +		!need_resched() && !signal_pending(current);
> +}
> +
> +/*
> + * TODO: how do we know that we have a working get_cycles?
> + * do we limit this by a configure dependacy?

In general it appears to require a run-time check.  You might need to
augment <asm/timex.h>.

> + * TODO: this is not safe when the device can be removed,
> + * but simple refcounting may prevent removal indefinatly
> + */
> +static inline int sk_poll_ll(struct sock *sk)
> +{
> +	struct napi_struct *napi = sk->dev_ref;
> +	const struct net_device_ops *ops;
> +	unsigned long end_time = sysctl_net_ll_poll + get_cycles();

ACCESS_ONCE(sysctl_net_ll_poll)

> +	if (!napi->dev || !napi->dev->netdev_ops ||
> +	    !napi->dev->netdev_ops->ndo_ll_poll)
> +		return false;
> +
> +	local_bh_disable();
> +
> +	ops = napi->dev->netdev_ops;
> +	while (skb_queue_empty(&sk->sk_receive_queue) &&
> +			!time_after((unsigned long)get_cycles(), end_time))

cycles_t may be narrower than unsigned long, in which case time_after()
will not compare correctly.  I think you need to open-code the
equivalent of time_after() but using cycles_t.

> +		if (ops->ndo_ll_poll(napi) == LL_FLUSH_FAILED)
> +				break; /* premanent failure */
> +
> +	local_bh_enable();
> +
> +	return !skb_queue_empty(&sk->sk_receive_queue);
> +}
> +
> +static inline void skb_mark_ll(struct napi_struct *napi, struct sk_buff *skb)
> +{
> +	skb->dev_ref = napi;
> +}

Slightly odd - I would expect skb to be the first parameter.

[...]
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -402,6 +402,18 @@ config INET_LRO
>  
>  	  If unsure, say Y.
>  
> +config INET_LL_RX_POLL
> +	bool "Low Latency Receive Poll"
> +	default n
> +	---help---
> +	  Support Low Latency Receive Queue Poll.
> +	  (For network card drivers which support this option.)
> +	  When waiting for data in read or poll call directly into the the device driver
> +	  to flush packets which may be pending on the device queues into the stack.
> +
> +
> +	  If unsure, say N.

Of course, all distributions will be expected to enable this.  So I'm
not sure what the point of the compile-time option is.  You might as
well enable it at compile-time but leave the default set to 0.

>  config INET_DIAG
>  	tristate "INET: socket monitoring interface"
>  	default y
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 960fd29..0c060c6 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -25,6 +25,7 @@
>  #include <net/inet_frag.h>
>  #include <net/ping.h>
>  #include <net/tcp_memcontrol.h>
> +#include <net/ll_poll.h>
>  
>  static int zero;
>  static int one = 1;
> @@ -326,6 +327,15 @@ static struct ctl_table ipv4_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +#ifdef CONFIG_INET_LL_RX_POLL
> +	{
> +		.procname	= "ip_low_latency_poll",
> +		.data		= &sysctl_net_ll_poll,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
> +#endif

This would need to be added to Documentation/networking/ip-sysctl.txt.

Should the units really be cycles or, say, microseconds?  I assume that
a sysctl setter can do a conversion to cycles so that there's no need to
multiply every time the value is used.  (If the CPU doesn't have
constant_tsc or equivalent then this conversion doesn't quite work, but
then low-latency tunng usually includes disabling frequency scaling.)

Also, this should be a per-device (or even per-NAPI-context?) setting.

>  	{
>  		.procname	= "tcp_syn_retries",
>  		.data		= &sysctl_tcp_syn_retries,
> diff --git a/net/socket.c b/net/socket.c
> index ee0d029..86da082 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -105,6 +105,12 @@
>  #include <linux/sockios.h>
>  #include <linux/atalk.h>
>  
> +#ifdef CONFIG_INET_LL_RX_POLL
> +#include <net/ll_poll.h>
> +int sysctl_net_ll_poll __read_mostly = 150000;

Nicely tuned for your specific test system, no doubt. :-)

> +EXPORT_SYMBOL_GPL(sysctl_net_ll_poll);
> +#endif
[...]

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html