netdev - Re: [PATCH net-next V2] net: dynamic ingress

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101002093255.GA2049@del.dom.local>
Date:	Sat, 2 Oct 2010 11:32:55 +0200
From:	Jarek Poplawski <jarkao2@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	hadi@...erus.ca, David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>
Subject: Re: [PATCH net-next V2] net: dynamic ingress_queue allocation

On Fri, Oct 01, 2010 at 03:56:28PM +0200, Eric Dumazet wrote:
> Le vendredi 01 octobre 2010 ?? 07:45 -0400, jamal a écrit :
> > On Fri, 2010-10-01 at 00:58 +0200, Eric Dumazet wrote:
> > > Hi Jamal
> > > 
> > > Here is the dynamic allocation I promised. I lightly tested it, could
> > > you review it please ?
> > > Thanks !
> > > 
> > > [PATCH net-next2.6] net: dynamic ingress_queue allocation
> > > 
> > > ingress being not used very much, and net_device->ingress_queue being
> > > quite a big object (128 or 256 bytes), use a dynamic allocation if
> > > needed (tc qdisc add dev eth0 ingress ...)
> > 
> > I agree with the principle that it is valuable in making it dynamic for
> > people who dont want it; but, but (like my kid would say, sniff, sniff)
> > you are making me sad saying it is not used very much ;-> It is used
> > very much in my world. My friend Jarek uses it;->

Thanks Jamal, my friend, I'm really glad! (And sad ;-)

> 
> ;)
> 
> > 
> > > +#ifdef CONFIG_NET_CLS_ACT
> > 
> > I think appropriately this should be NET_SCH_INGRESS (everywhere else as
> > well).
> > 
> 
> I first thought of this, and found it would add a new dependence on
> vmlinux :
> 
> If someone wants to add NET_SCH_INGRESS module, he would need to
> recompile whole kernel and reboot.
> 
> This is probably why ing_filter() and handle_ing() are enclosed with
> CONFIG_NET_CLS_ACT, not CONFIG_NET_SCH_INGRESS.
> 
> Since struct net_dev only holds one pointer (after this patch), I
> believe its better to use same dependence.
> 
> > 
> > > +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> > > +{
> > > +#ifdef CONFIG_NET_CLS_ACT
> > > +	return dev->ingress_queue;
> > > +#else
> > > +	return NULL;
> > > +#endif
> > 
> > Above, if you just returned dev->ingress_queue wouldnt it always be 
> > NULL if it was not allocated?
> > 
> 
> ingress_queue is not defined in "struct net_device *dev" if 
> !CONFIG_NET_CLS_ACT
> 
> Returning NULL here permits dead code elimination by compiler.
> 
> Then, probably nobody unset CONFIG_NET_CLS_ACT, so we can probably avoid
> this preprocessor stuff.
> 
> > 
> > > @@ -2737,7 +2734,9 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
> > >  					 struct packet_type **pt_prev,
> > >  					 int *ret, struct net_device *orig_dev)
> > >  {
> > > -	if (skb->dev->ingress_queue.qdisc == &noop_qdisc)
> > > +	struct netdev_queue *rxq = dev_ingress_queue(skb->dev);
> > > +
> > > +	if (!rxq || rxq->qdisc == &noop_qdisc)
> > >  		goto out;
> > 
> > I stared at above a little longer since this is the only fast path
> > affected; is it a few more cycles now for people who love ingress?
> 
> I see, this adds an indirection and a conditional branch, but this
> should be in cpu cache and well predicted.
> 
> I thought adding a fake "struct netdev_queue" object, with a qdisc
> pointing to noop_qdisc. But this would slow down a bit non ingress
> users ;)
> 
> For people not using ingress, my solution removes an access to an extra
> cache line. So network latency is improved a bit when cpu caches are
> full of user land data.
> 
> > 
> > > @@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> > >  		    (new && new->flags & TCQ_F_INGRESS)) {
> > >  			num_q = 1;
> > >  			ingress = 1;
> > > +			if (!dev_ingress_queue(dev))
> > > +				return -ENOENT;
> > >  		}
> > >  
> > 
> > The above looks clever but worries me because it changes the old flow.
> > If you have time,  the following tests will alleviate my fears
> > 
> > 1) compile support for ingress and add/delete ingress qdisc
> 
> This worked for me, but I dont know complex setups.
> 
> > 2) Dont compile support and add/delete ingress qdisc
> 
> tc gives an error (a bit like !CONFIG_NET_SCH_INGRESS)
> 
> # tc qdisc add dev eth0 ingress
> RTNETLINK answers: No such file or directory
> # tc -s -d qdisc show dev eth0
> qdisc mq 0: root 
>  Sent 636 bytes 10 pkt (dropped 0, overlimits 0 requeues 0) 
>  backlog 0b 0p requeues 0 
> 
> 
> > 3) Compile ingress as a module and add/delete ingress qdisc
> > 
> > 
> 
> Seems to work like 1)
> 
> > Other than that excellent work Eric. And you can add my
> > Acked/reviewed-by etc.
> > 
> > BTW, did i say i like your per-cpu stats stuff? It applies nicely to
> > qdiscs, actions etc ;->
> 
> I took a look at ifb as suggested by Stephen but could not see trivial
> changes (LLTX or per-cpu stats), since central lock is needed I am
> afraid. And qdisc are the same, stats updates are mostly free as we
> dirtied cache line for the lock.
> 
> Thanks Jamal !
> 
> Here is the V2, with two #ifdef removed.
> 
> 
> [PATCH net-next V2] net: dynamic ingress_queue allocation
> 
> ingress being not used very much, and net_device->ingress_queue being
> quite a big object (128 or 256 bytes), use a dynamic allocation if
> needed (tc qdisc add dev eth0 ingress ...)
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@...il.com>
> ---
>  include/linux/netdevice.h |   11 ++++++--
>  net/core/dev.c            |   48 +++++++++++++++++++++++++++---------
>  net/sched/sch_api.c       |   40 ++++++++++++++++++++----------
>  net/sched/sch_generic.c   |   36 +++++++++++++++------------
>  4 files changed, 92 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index ceed347..4f86009 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -986,8 +986,7 @@ struct net_device {
>  	rx_handler_func_t	*rx_handler;
>  	void			*rx_handler_data;
>  
> -	struct netdev_queue	ingress_queue; /* use two cache lines */
> -
> +	struct netdev_queue	*ingress_queue;
>  /*
>   * Cache lines mostly used on transmit path
>   */
> @@ -1115,6 +1114,14 @@ static inline void netdev_for_each_tx_queue(struct net_device *dev,
>  		f(dev, &dev->_tx[i], arg);
>  }
>  
> +
> +static inline struct netdev_queue *dev_ingress_queue(struct net_device *dev)
> +{
> +	return dev->ingress_queue;
> +}
> +
> +extern struct netdev_queue *dev_ingress_queue_create(struct net_device *dev);
> +
>  /*
>   * Net namespace inlines
>   */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index a313bab..e3bb8c9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2702,11 +2702,10 @@ EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
>   * the ingress scheduler, you just cant add policies on ingress.
>   *
>   */
> -static int ing_filter(struct sk_buff *skb)
> +static int ing_filter(struct sk_buff *skb, struct netdev_queue *rxq)
>  {
>  	struct net_device *dev = skb->dev;
>  	u32 ttl = G_TC_RTTL(skb->tc_verd);
> -	struct netdev_queue *rxq;
>  	int result = TC_ACT_OK;
>  	struct Qdisc *q;
>  
> @@ -2720,8 +2719,6 @@ static int ing_filter(struct sk_buff *skb)
>  	skb->tc_verd = SET_TC_RTTL(skb->tc_verd, ttl);
>  	skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_INGRESS);
>  
> -	rxq = &dev->ingress_queue;
> -
>  	q = rxq->qdisc;
>  	if (q != &noop_qdisc) {
>  		spin_lock(qdisc_lock(q));
> @@ -2737,7 +2734,9 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
>  					 struct packet_type **pt_prev,
>  					 int *ret, struct net_device *orig_dev)
>  {
> -	if (skb->dev->ingress_queue.qdisc == &noop_qdisc)
> +	struct netdev_queue *rxq = dev_ingress_queue(skb->dev);
> +
> +	if (!rxq || rxq->qdisc == &noop_qdisc)
>  		goto out;
>  
>  	if (*pt_prev) {
> @@ -2745,7 +2744,7 @@ static inline struct sk_buff *handle_ing(struct sk_buff *skb,
>  		*pt_prev = NULL;
>  	}
>  
> -	switch (ing_filter(skb)) {
> +	switch (ing_filter(skb, rxq)) {
>  	case TC_ACT_SHOT:
>  	case TC_ACT_STOLEN:
>  		kfree_skb(skb);
> @@ -4932,15 +4931,17 @@ static void __netdev_init_queue_locks_one(struct net_device *dev,
>  					  struct netdev_queue *dev_queue,
>  					  void *_unused)
>  {
> -	spin_lock_init(&dev_queue->_xmit_lock);
> -	netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type);
> -	dev_queue->xmit_lock_owner = -1;
> +	if (dev_queue) {
> +		spin_lock_init(&dev_queue->_xmit_lock);
> +		netdev_set_xmit_lockdep_class(&dev_queue->_xmit_lock, dev->type);
> +		dev_queue->xmit_lock_owner = -1;
> +	}
>  }
>  
>  static void netdev_init_queue_locks(struct net_device *dev)
>  {
>  	netdev_for_each_tx_queue(dev, __netdev_init_queue_locks_one, NULL);
> -	__netdev_init_queue_locks_one(dev, &dev->ingress_queue, NULL);
> +	__netdev_init_queue_locks_one(dev, dev_ingress_queue(dev), NULL);

Is dev_ingress_queue(dev) not NULL anytime here?

>  }
>  
>  unsigned long netdev_fix_features(unsigned long features, const char *name)
> @@ -5447,16 +5448,37 @@ static void netdev_init_one_queue(struct net_device *dev,
>  				  struct netdev_queue *queue,
>  				  void *_unused)
>  {
> -	queue->dev = dev;
> +	if (queue)
> +		queue->dev = dev;
>  }
>  
>  static void netdev_init_queues(struct net_device *dev)
>  {
> -	netdev_init_one_queue(dev, &dev->ingress_queue, NULL);
> +	netdev_init_one_queue(dev, dev_ingress_queue(dev), NULL);

Is dev_ingress_queue(dev) not NULL anytime here?

>  	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
>  	spin_lock_init(&dev->tx_global_lock);
>  }
>  
> +struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
> +{
> +	struct netdev_queue *queue = dev_ingress_queue(dev);
> +
> +#ifdef CONFIG_NET_CLS_ACT
> +	if (queue)
> +		return queue;
> +	queue = kzalloc(sizeof(*queue), GFP_KERNEL);
> +	if (!queue)
> +		return NULL;
> +	netdev_init_one_queue(dev, queue, NULL);
> +	__netdev_init_queue_locks_one(dev, queue, NULL);
> +	queue->qdisc = &noop_qdisc;
> +	queue->qdisc_sleeping = &noop_qdisc;
> +	smp_wmb();

Why don't we need smp_rmb() in handle_ing()?

> +	dev->ingress_queue = queue;
> +#endif
> +	return queue;
> +}
> +
>  /**
>   *	alloc_netdev_mq - allocate network device
>   *	@sizeof_priv:	size of private data to allocate space for
> @@ -5559,6 +5581,8 @@ void free_netdev(struct net_device *dev)
>  
>  	kfree(dev->_tx);
>  
> +	kfree(dev_ingress_queue(dev));
> +
>  	/* Flush device addresses */
>  	dev_addr_flush(dev);
>  
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index b802078..8635110 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -240,7 +240,10 @@ struct Qdisc *qdisc_lookup(struct net_device *dev, u32 handle)
>  	if (q)
>  		goto out;
>  
> -	q = qdisc_match_from_root(dev->ingress_queue.qdisc_sleeping, handle);
> +	if (!dev_ingress_queue(dev))
> +		goto out;
> +	q = qdisc_match_from_root(dev_ingress_queue(dev)->qdisc_sleeping,
> +				  handle);

I'd prefer:
 +	if (dev_ingress_queue(dev))
 +		q = qdisc_match_from_root(dev_ingress_queue(dev)->qdisc_sleeping,

>  out:
>  	return q;
>  }
> @@ -690,6 +693,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  		    (new && new->flags & TCQ_F_INGRESS)) {
>  			num_q = 1;
>  			ingress = 1;
> +			if (!dev_ingress_queue(dev))
> +				return -ENOENT;

Is this test really needed here?

>  		}
>  
>  		if (dev->flags & IFF_UP)
> @@ -701,7 +706,7 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>  		}
>  
>  		for (i = 0; i < num_q; i++) {
> -			struct netdev_queue *dev_queue = &dev->ingress_queue;
> +			struct netdev_queue *dev_queue = dev_ingress_queue(dev);
>  
>  			if (!ingress)
>  				dev_queue = netdev_get_tx_queue(dev, i);
> @@ -979,7 +984,8 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
>  					return -ENOENT;
>  				q = qdisc_leaf(p, clid);
>  			} else { /* ingress */
> -				q = dev->ingress_queue.qdisc_sleeping;
> +				if (dev_ingress_queue(dev))
> +					q = dev_ingress_queue(dev)->qdisc_sleeping;
>  			}
>  		} else {
>  			q = dev->qdisc;
> @@ -1044,7 +1050,8 @@ replay:
>  					return -ENOENT;
>  				q = qdisc_leaf(p, clid);
>  			} else { /*ingress */
> -				q = dev->ingress_queue.qdisc_sleeping;
> +				if (dev_ingress_queue_create(dev))
> +					q = dev_ingress_queue(dev)->qdisc_sleeping;

I wonder if doing dev_ingress_queue_create() just before qdisc_create()
(and the test here) isn't more readable.

>  			}
>  		} else {
>  			q = dev->qdisc;
> @@ -1123,11 +1130,14 @@ replay:
>  create_n_graft:
>  	if (!(n->nlmsg_flags&NLM_F_CREATE))
>  		return -ENOENT;
> -	if (clid == TC_H_INGRESS)
> -		q = qdisc_create(dev, &dev->ingress_queue, p,
> -				 tcm->tcm_parent, tcm->tcm_parent,
> -				 tca, &err);
> -	else {
> +	if (clid == TC_H_INGRESS) {
> +		if (dev_ingress_queue(dev))
> +			q = qdisc_create(dev, dev_ingress_queue(dev), p,
> +					 tcm->tcm_parent, tcm->tcm_parent,
> +					 tca, &err);
> +		else
> +			err = -ENOENT;
> +	} else {
>  		struct netdev_queue *dev_queue;
>  
>  		if (p && p->ops->cl_ops && p->ops->cl_ops->select_queue)
> @@ -1304,8 +1314,10 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct netlink_callback *cb)
>  		if (tc_dump_qdisc_root(dev->qdisc, skb, cb, &q_idx, s_q_idx) < 0)
>  			goto done;
>  
> -		dev_queue = &dev->ingress_queue;
> -		if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb, &q_idx, s_q_idx) < 0)
> +		dev_queue = dev_ingress_queue(dev);
> +		if (dev_queue &&
> +		    tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb,
> +				       &q_idx, s_q_idx) < 0)
>  			goto done;
>  
>  cont:
> @@ -1595,8 +1607,10 @@ static int tc_dump_tclass(struct sk_buff *skb, struct netlink_callback *cb)
>  	if (tc_dump_tclass_root(dev->qdisc, skb, tcm, cb, &t, s_t) < 0)
>  		goto done;
>  
> -	dev_queue = &dev->ingress_queue;
> -	if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb, &t, s_t) < 0)
> +	dev_queue = dev_ingress_queue(dev);
> +	if (dev_queue &&
> +	    tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb,
> +				&t, s_t) < 0)
>  		goto done;
>  
>  done:
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 545278a..c42dec5 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -721,16 +721,18 @@ static void transition_one_qdisc(struct net_device *dev,
>  				 struct netdev_queue *dev_queue,
>  				 void *_need_watchdog)
>  {
> -	struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
> -	int *need_watchdog_p = _need_watchdog;
> +	if (dev_queue) {
> +		struct Qdisc *new_qdisc = dev_queue->qdisc_sleeping;
> +		int *need_watchdog_p = _need_watchdog;
>  
> -	if (!(new_qdisc->flags & TCQ_F_BUILTIN))
> -		clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
> +		if (!(new_qdisc->flags & TCQ_F_BUILTIN))
> +			clear_bit(__QDISC_STATE_DEACTIVATED, &new_qdisc->state);
>  
> -	rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
> -	if (need_watchdog_p && new_qdisc != &noqueue_qdisc) {
> -		dev_queue->trans_start = 0;
> -		*need_watchdog_p = 1;
> +		rcu_assign_pointer(dev_queue->qdisc, new_qdisc);
> +		if (need_watchdog_p && new_qdisc != &noqueue_qdisc) {
> +			dev_queue->trans_start = 0;
> +			*need_watchdog_p = 1;
> +		}
>  	}
>  }
>  
> @@ -753,7 +755,7 @@ void dev_activate(struct net_device *dev)
>  
>  	need_watchdog = 0;
>  	netdev_for_each_tx_queue(dev, transition_one_qdisc, &need_watchdog);
> -	transition_one_qdisc(dev, &dev->ingress_queue, NULL);
> +	transition_one_qdisc(dev, dev_ingress_queue(dev), NULL);

I'd prefer here and similarly later:

 +	if (dev_ingress_queue(dev))
 +		transition_one_qdisc(dev, dev_ingress_queue(dev), NULL);

to show NULL dev_queue is only legal in this one case.

Cheers,
Jarek P.

>  
>  	if (need_watchdog) {
>  		dev->trans_start = jiffies;
> @@ -768,7 +770,7 @@ static void dev_deactivate_queue(struct net_device *dev,
>  	struct Qdisc *qdisc_default = _qdisc_default;
>  	struct Qdisc *qdisc;
>  
> -	qdisc = dev_queue->qdisc;
> +	qdisc = dev_queue ? dev_queue->qdisc : NULL;
>  	if (qdisc) {
>  		spin_lock_bh(qdisc_lock(qdisc));
>  
> @@ -812,7 +814,7 @@ static bool some_qdisc_is_busy(struct net_device *dev)
>  void dev_deactivate(struct net_device *dev)
>  {
>  	netdev_for_each_tx_queue(dev, dev_deactivate_queue, &noop_qdisc);
> -	dev_deactivate_queue(dev, &dev->ingress_queue, &noop_qdisc);
> +	dev_deactivate_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
>  
>  	dev_watchdog_down(dev);
>  
> @@ -830,15 +832,17 @@ static void dev_init_scheduler_queue(struct net_device *dev,
>  {
>  	struct Qdisc *qdisc = _qdisc;
>  
> -	dev_queue->qdisc = qdisc;
> -	dev_queue->qdisc_sleeping = qdisc;
> +	if (dev_queue) {
> +		dev_queue->qdisc = qdisc;
> +		dev_queue->qdisc_sleeping = qdisc;
> +	}
>  }
>  
>  void dev_init_scheduler(struct net_device *dev)
>  {
>  	dev->qdisc = &noop_qdisc;
>  	netdev_for_each_tx_queue(dev, dev_init_scheduler_queue, &noop_qdisc);
> -	dev_init_scheduler_queue(dev, &dev->ingress_queue, &noop_qdisc);
> +	dev_init_scheduler_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
>  
>  	setup_timer(&dev->watchdog_timer, dev_watchdog, (unsigned long)dev);
>  }
> @@ -847,7 +851,7 @@ static void shutdown_scheduler_queue(struct net_device *dev,
>  				     struct netdev_queue *dev_queue,
>  				     void *_qdisc_default)
>  {
> -	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
> +	struct Qdisc *qdisc = dev_queue ? dev_queue->qdisc_sleeping : NULL;
>  	struct Qdisc *qdisc_default = _qdisc_default;
>  
>  	if (qdisc) {
> @@ -861,7 +865,7 @@ static void shutdown_scheduler_queue(struct net_device *dev,
>  void dev_shutdown(struct net_device *dev)
>  {
>  	netdev_for_each_tx_queue(dev, shutdown_scheduler_queue, &noop_qdisc);
> -	shutdown_scheduler_queue(dev, &dev->ingress_queue, &noop_qdisc);
> +	shutdown_scheduler_queue(dev, dev_ingress_queue(dev), &noop_qdisc);
>  	qdisc_destroy(dev->qdisc);
>  	dev->qdisc = &noop_qdisc;
>  
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html