netdev - Re: [RFC net-next PATCH] macvtap: Add packet capture support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <528D313D.5000505@redhat.com>
Date:	Wed, 20 Nov 2013 17:01:33 -0500
From:	Vlad Yasevich <vyasevic@...hat.com>
To:	"Michael S. Tsirkin" <mst@...hat.com>
CC:	netdev@...r.kernel.org
Subject: Re: [RFC net-next PATCH] macvtap: Add packet capture support

On 11/20/2013 04:12 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 20, 2013 at 03:35:22PM -0500, Vlad Yasevich wrote:
>> On 11/20/2013 03:30 PM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 20, 2013 at 03:19:01PM -0500, Vlad Yasevich wrote:
>>>> On 11/20/2013 03:06 PM, Michael S. Tsirkin wrote:
>>>>> On Wed, Nov 20, 2013 at 02:36:40PM -0500, Vlad Yasevich wrote:
>>>>>> On 11/20/2013 01:19 PM, Michael S. Tsirkin wrote:
>>>>>>> On Wed, Nov 20, 2013 at 01:04:09PM -0500, Vlad Yasevich wrote:
>>>>>>>> Currently it is impossible to capture traffic when using a macvtap
>>>>>>>> device.  The reason is that all capture handling is done either in
>>>>>>>> dev_hard_start_xmit() or in __netif_receive_skb_core().  Macvtap
>>>>>>>> currenlty doesn't use dev_hard_start_xmit(), and at the time the
>>>>>>>> packet traverses __netif_receive_skb_core, the skb->dev is set to
>>>>>>>> the lower-level device that doesn't end up matching macvtap.
>>>>>>>>
>>>>>>>> To solve the issue, use dev_hard_start_xmit() on the output path.
>>>>>>>> On the input path, it is toughter to solve since macvtap ends up
>>>>>>>> consuming the skb so there is nothing more left for
>>>>>>>> __netif_receive_skb_core() to do.
>>>>>
>>>>> Actually I thought I understand what you are saying here, but now I
>>>>> don't. bridge installs rx handler exactly in the same way.
>>>>> packet handlers seem to be called before the rx handlers so
>>>>> everything should just work.
>>>>>
>>>>> Is this about the bridge mode again?
>>>>
>>>> No.  It has to do with bridge submitting the frame back to the
>>>> network stack to forward it to the ports, but macvtap ends up
>>>> stealing it.
>>>
>>> Confused.
>>>
>>>         rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>         if (rx_handler) {
>>>                 if (pt_prev) {
>>>                         ret = deliver_skb(skb, pt_prev, orig_dev);
>>>                         pt_prev = NULL;
>>>                 }
>>>                 switch (rx_handler(&skb)) {
>>>
>>> 		....
>>>
>>> so packet handlers (including packet socket)
>>> seem to be invoked before rx handlers (including bridge and macvtap).
>>>
>>> What's the issue then?
>>>
>>> I guess I'm missing something obvious.
>>
>> In case of macvtap0@...0, before rx_handler is invoked skb->dev == eth0.
>> During the macvlan rx_handler, skb->dev = macvtap0, and the packet is
>> stolen and delivered to the socket, thus no capture.
> 
> Aha. I get what the issue is now. So tcpdump -i eth0 works, you want to make
> tcpdump -i macvtap0@...0 work too.
> 
>> In the case of the bridge (eth0, eth1), during the rx_handler
>> skb->dev = bridge.  Bridge calls netif_receive_skb() again, when in
>> promisc mode, and now packet is captured.
>>
>> -vlad
> 
> Well but not in non promisc mode correct?
> 
> I guess a closer model is tun, you can always
> do tcpdump on it and see whatever application sees.
> 
> I guess one other option is a new RX_HANDLER_DEV_CHANGED -
> same as consumed but invokes taps after we changed the skb dev
> 
>                 case RX_HANDLER_DEV_CHANGED:
> 			if (!list_empty(&ptype_all))
>                         	goto another_round;
>                         ret = NET_RX_SUCCESS;
> 	                kfree_skb(skb);
>                         goto unlock;
> 

This is going to have a problem in that by doing another_round, you not
only do the taps, but you also attempt to deliver locally if no
rx_handlers consume the skb.  With the current code, there will be
no rx_handlers on the macvtap device and you'll attempt to deliver
locally to the host.  Not what we want...

-vlad

> 
>>>
>>>
>>>
>>>> Also, bridge, if running in promisc mode will hand the frame up
>>>> (as if it received it).  See the IFF_PROMISC code in
>>>> br_handle_frame_finish().
>>>> So, if someone is capturing on the bridge itself, this lets it
>>>> see the packets even though they are not destined for the bridge
>>>> device.
>>>
>>> Yes, that's quite expected: same thing will happen with
>>> real hardware: set it to promisc, see lots of packets
>>> destined at others.
>>>
>>>>>
>>>>>>  A simple solution is to
>>>>>>>> pull the code that delivers things to the taps/captures into
>>>>>>>> its own function and allow macvtap to call it from its receive
>>>>>>>> routine.
>>>>>>>>
>>>>>>>> Signed-off-by: Vlad Yasevich <vyasevic@...hat.com>
>>>>>>>> ---
>>>>>>>> This is only an RFC.  I'd like to solicit comments on this simple
>>>>>>>> approach.
>>>>>>>
>>>>>>> I'm kind of worried about this. What worries me is that normally
>>>>>>> if you have a packet socket bound to all interfaces, what it shows is
>>>>>>> traffic to/from the box.
>>>>>>>
>>>>>>> This might be a bug for someone, but I suspect at this point this
>>>>>>> is part of the ABI.
>>>>>>>
>>>>>>> But macvtap bypasses most of the host networking stack,
>>>>>>> So I worry that suddenly showing these packets would be confusing.
>>>>>>
>>>>>> Is it really different from using bridge and tap?  If someone
>>>>>> does 'tcpdump -i any', they will see packets sent by the guest with
>>>>>> bridge+tap.
>>>>>>  It makes sense to do that same thing for macvtap, no?
>>>>>
>>>>> I was going by your comments not the code.
>>>>> Assuming we never showed macvtap traffic this might
>>>>> be part of ABI.
>>>>
>>>> BTW, if we end up doing it with a new rx_handler, it will end up
>>>> showing exactly the same traffic is this patch does because the
>>>> common path in __netif_receive_skb_core() will run and deliver
>>>> to all registered eligible entries in ptype_all.
>>>>
>>>> -vlad
>>>
>>> Dave here thinks it's not a problem, I trust him ...
>>>
>>>>>
>>>>>>>
>>>>>>> Assuming we want to show this traffic, I think it's preferable to
>>>>>>> -		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>>>> +		if (ptype->dev == skb->dev) {
>>>>>>> so you have to bind to macvtap explicitly to see the traffic.
>>>>>>>
>>>>>>>
>>>>>>> Of course when you start binding to specific macvtaps
>>>>>>> it doesn't scale well because we'll have a long list
>>>>>>> on ptype_all suddenly.
>>>>>>
>>>>>> How likely is that really?  ptype_all doesn't scale as it.
>>>>>> Any time you don't set the protocol field, the entry goes
>>>>>> into ptype_all.  Macvtap doesn't change that.  You can
>>>>>> create a long list in ptype all if you have lots of guests and
>>>>>> you want to snoop on each guest separately thus binding to tap
>>>>>> device.
>>>>>>
>>>>>> -vlad
>>>>>>
>>>>>>> This might mean we need to keep this list per-device ...
>>>>>>>
>>>>>>>> A more complicated solution would have been to give
>>>>>>>> macvtap its own rx_handler and return RX_HANDLER_ANOTHER during
>>>>>>>> receive operation to make the packet go through another round of
>>>>>>>> capturing and hit the macvtap rx_handler.  I thought this would
>>>>>>>> hurt performance too much for no real gain.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> -vlad
>>>>>>>>
>>>>>>>>  drivers/net/macvtap.c     |  4 +++-
>>>>>>>>  include/linux/netdevice.h |  2 ++
>>>>>>>>  net/core/dev.c            | 33 ++++++++++++++++++++++++++-------
>>>>>>>>  3 files changed, 31 insertions(+), 8 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>>>>>>>> index dc76670..0ed8fae 100644
>>>>>>>> --- a/drivers/net/macvtap.c
>>>>>>>> +++ b/drivers/net/macvtap.c
>>>>>>>> @@ -334,6 +334,7 @@ drop:
>>>>>>>>   */
>>>>>>>>  static int macvtap_receive(struct sk_buff *skb)
>>>>>>>>  {
>>>>>>>> +	netif_receive_skb_taps(skb, true);
>>>>>>>>  	skb_push(skb, ETH_HLEN);
>>>>>>>>  	return macvtap_forward(skb->dev, skb);
>>>>>>>>  }
>>>>>>>> @@ -727,8 +728,9 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
>>>>>>>>  		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
>>>>>>>>  	}
>>>>>>>>  	if (vlan) {
>>>>>>>> +		skb->dev = vlan->dev;
>>>>>>>>  		local_bh_disable();
>>>>>>>> -		macvlan_start_xmit(skb, vlan->dev);
>>>>>>>> +		dev_hard_start_xmit(skb, vlan->dev, NULL, NULL);
>>>>>>>>  		local_bh_enable();
>>>>>>>>  	} else {
>>>>>>>>  		kfree_skb(skb);
>>>>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>>>>>> index 8b3de7c..84880eb 100644
>>>>>>>> --- a/include/linux/netdevice.h
>>>>>>>> +++ b/include/linux/netdevice.h
>>>>>>>> @@ -2383,6 +2383,8 @@ void dev_kfree_skb_any(struct sk_buff *skb);
>>>>>>>>  int netif_rx(struct sk_buff *skb);
>>>>>>>>  int netif_rx_ni(struct sk_buff *skb);
>>>>>>>>  int netif_receive_skb(struct sk_buff *skb);
>>>>>>>> +struct packet_type *netif_receive_skb_taps(struct sk_buff *skb,
>>>>>>>> +					   bool last_deliver);
>>>>>>>>  gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
>>>>>>>>  void napi_gro_flush(struct napi_struct *napi, bool flush_old);
>>>>>>>>  struct sk_buff *napi_get_frags(struct napi_struct *napi);
>>>>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>>>>> index da9c5e1..50f0ac4 100644
>>>>>>>> --- a/net/core/dev.c
>>>>>>>> +++ b/net/core/dev.c
>>>>>>>> @@ -3484,6 +3484,31 @@ static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
>>>>>>>>  	}
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +struct packet_type *netif_receive_skb_taps(struct sk_buff *skb,
>>>>>>>> +					   bool last_deliver)
>>>>>>>> +{
>>>>>>>> +	struct packet_type *ptype, *pt_prev = NULL;
>>>>>>>> +	struct net_device *orig_dev;
>>>>>>>> +
>>>>>>>> +	if (list_empty(&ptype_all))
>>>>>>>> +		return NULL;
>>>>>>>> +
>>>>>>>> +	orig_dev = skb->dev;
>>>>>>>> +	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>>>>>>>> +		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>>>>> +			if (pt_prev)
>>>>>>>> +				deliver_skb(skb, pt_prev, orig_dev);
>>>>>>>> +			pt_prev = ptype;
>>>>>>>> +		}
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	if (last_deliver && pt_prev)
>>>>>>>> +		deliver_skb(skb, pt_prev, orig_dev);
>>>>>>>> +
>>>>>>>> +	return pt_prev;
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL_GPL(netif_receive_skb_taps);
>>>>>>>> +
>>>>>>>>  static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
>>>>>>>>  {
>>>>>>>>  	struct packet_type *ptype, *pt_prev;
>>>>>>>> @@ -3535,13 +3560,7 @@ another_round:
>>>>>>>>  	if (pfmemalloc)
>>>>>>>>  		goto skip_taps;
>>>>>>>>  
>>>>>>>> -	list_for_each_entry_rcu(ptype, &ptype_all, list) {
>>>>>>>> -		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>>>>> -			if (pt_prev)
>>>>>>>> -				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>>>>> -			pt_prev = ptype;
>>>>>>>> -		}
>>>>>>>> -	}
>>>>>>>> +	pt_prev = netif_receive_skb_taps(skb, false);
>>>>>>>>  
>>>>>>>>  skip_taps:
>>>>>>>>  #ifdef CONFIG_NET_CLS_ACT
>>>>>>>> -- 
>>>>>>>> 1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html