netdev - Re: [PATCH net] net: rps: fix cpu unplug

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1421367274.11734.105.camel@edumazet-glaptop2.roam.corp.google.com>
Date:	Thu, 15 Jan 2015 16:14:34 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	subashab@...eaurora.org
Cc:	Prasad Sodagudi <psodagud@...eaurora.org>, netdev@...r.kernel.org,
	Tom Herbert <therbert@...gle.com>
Subject: Re: [PATCH net] net: rps: fix cpu unplug

On Thu, 2015-01-15 at 22:29 +0000, subashab@...eaurora.org wrote:
> Thanks for the patch. I shall try it out and provide feedback soon.
> But we think the race condition issue is different. The crash was observed
> in the process_queue.


> 
> On the event of a CPU hotplug, the NAPI poll_list is copied over from the
> offline CPU to the CPU on which dev_cpu_callback() was called. These
> operations happens in  dev_cpu_callback() in the context of the notifier
> chain from hotplug framework.  Also in the same hotplug notifier context
> (dev_cpu_callback) the input_pkt_queue and  process_queue of the offline
> CPU are dequeued and sent up the network stack and this is where I think
> the race/problem is.
> 
> Context1: The online CPU starts processing the poll_list from
> net_rx_action since a
> softIRQ was raised in dev_cpu_callback(). process_backlog() draining the
> process queue
> 
> Context2: hotplug notifier dev_cpu_callback() draining the queues and
> calling netif_rx().
> 
> from dev_cpu_callback()
> 	/* Process offline CPU's input_pkt_queue */
> 	while ((skb = __skb_dequeue(&oldsd->process_queue))) {
> 		netif_rx(skb);
> 		input_queue_head_incr(oldsd);
> 	}
> 	while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) {
> 		netif_rx(skb);
> 		input_queue_head_incr(oldsd);
> 	}
> 
> Is this de-queuing(the above code snippet from dev_cpu_callback())
> actually necessary since the poll_list should already have the backlog
> napi struct of the old CPU? In this case when process_backlog()
> actually runs it should drain these two queues of older CPU.
> Let me know your thoughts.


input_pkt_queue and process_queue have nothing to do with NAPI
poll_list : They store skbs.

dev_cpu_callback() is called when the cpu we are offlining is no longer
running. No interrupts either serviced by this offline cpu.

You have the absolute guarantee No one is manipulating process_queue at
the same time than you.

It looks like you found another issue, not related to RPS, but due to
the fact that commit 264524d5e5195f6e
("net: cpu offline cause napi stall") did not exclude the percpu
backlog.

process_backlog() MUST be called by the owner cpu. Otherwise we would
need to add locking everywhere, as you did, and this is simply insane.

I'll send a V2


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html