netdev - Re: [PATCH 1/1] net: race condition when removing virtual net

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+HUmGih9akzhpRrb_0WEapi4jGcuSV8qO==QeRWHoqnxzFyng@mail.gmail.com>
Date:	Mon, 16 Sep 2013 13:30:50 -0700
From:	Francesco Ruggeri <fruggeri@...stanetworks.com>
To:	"Eric W. Biederman" <ebiederm@...ssion.com>
Cc:	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jiri Pirko <jiri@...nulli.us>,
	Alexander Duyck <alexander.h.duyck@...el.com>,
	Cong Wang <amwang@...hat.com>, netdev@...r.kernel.org
Subject: Re: [PATCH 1/1] net: race condition when removing virtual net_device

On Mon, Sep 16, 2013 at 3:45 AM, Eric W. Biederman
<ebiederm@...ssion.com> wrote:

> I believe the weird reordering of veth devices is solved or largely
> solved by:
>
> commit f45a5c267da35174e22cec955093a7513dc1623d

I looked at d0e2c55e and f45a5c26 in the past and I do not think they
are related to the problem I see, which I think is more related to the
infrastructure. See more below.

>
>> and this is the sequence after pushing the loopback_devs to the tail
>> of net_todo_list:
>>
>> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
>> unregister_netdevice_queue: lo (ns ffff8800379f8000)
>> unregister_netdevice_queue: v1 (ns ffff8800378c0b00)
>> unregister_netdevice_queue: v0 (ns ffff8800379f8000)
>> unregister_netdevice_queue: lo (ns ffff8800378c0b00)
>> unregister_netdevice_many: lo (ns ffff8800379f8000) v1 (ns
>> ffff8800378c0b00) v0 (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
>> netdev_run_todo: v1 (ns ffff8800378c0b00) v0 (ns ffff8800379f8000) lo
>> (ns ffff8800379f8000) lo (ns ffff8800378c0b00)
>
> I am scratching my head.  That isn't what I would expect from putting
> loopback devices at the end of the list.

The way to read it is as follows:
unregister_netdevice_many gets the sequence lo0 v1 v0 lo1 (lo0 is the
occurrence of "lo" with the same namespace as v0, etc.).
As the interfaces are unlisted the loopback_devs are queued at the end
of net_todo_list, so when netdev_run_todo executes net_todo_list is v1
v0 lo0 lo1.

>
> Even more I am not comfortable with any situation where preserving the
> order in which the devices are unregistered is not sufficient to make
> things work correctly.  That quickly gets into fragile layering
> problems, and I don't know how anyone would be able to maintain that.
>

I agree with your general statement, but unfortunately the order in
which the interfaces are unregistered is not preserved even in the
current code, since dev_close_many already rearranges them based on
whether they are UP or not.

If I delete a namespace with only lo and v0 this is the order in which
they are released depending on their state. If lo is DOWN then it will
be processed before v0 in netdev_run_todo.

====== lo down, v0 down
unregister_netdevice_queue: v0 (ns ffff880037bb0b00)
unregister_netdevice_queue: lo (ns ffff880037bb0b00)
unregister_netdevice_many: v0 (ns ffff880037bb0b00) lo (ns ffff880037bb0b00)
netdev_run_todo: lo (ns ffff880037bb0b00) v0 (ns ffff880037bb0b00)

====== lo up, v0 down
unregister_netdevice_queue: v0 (ns ffff8800378a9600)
unregister_netdevice_queue: lo (ns ffff8800378a9600)
unregister_netdevice_many: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)
netdev_run_todo: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)

====== lo down, v0 up
unregister_netdevice_queue: v0 (ns ffff880037bb0b00)
unregister_netdevice_queue: lo (ns ffff880037bb0b00)
unregister_netdevice_many: v0 (ns ffff880037bb0b00) lo (ns ffff880037bb0b00)
netdev_run_todo: lo (ns ffff880037bb0b00) v0 (ns ffff880037bb0b00)

====== lo up, v0 up
unregister_netdevice_queue: v0 (ns ffff8800378a9600)
unregister_netdevice_queue: lo (ns ffff8800378a9600)
unregister_netdevice_many: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)
netdev_run_todo: v0 (ns ffff8800378a9600) lo (ns ffff8800378a9600)


> I still don't see a better solution than dev_change_net_namespace,
> for the network devices that have this problem.  Network devices are not
> supposed to be findable after the dance at the start of cleanup_net.
>

I think the problem we are seeing is an infrastructure one. If I can restate it:

1) When destroying a namespace all net_devices in it should be
disposed of before any non-device pernet subsystems exit. The existing
code tries to do this but it does not take into consideration
net_devices that may have been unlisted from the namespace by another
process which is still in the process of destroying them (specifically
in netdev_run_todo/netdev_wait_allrefs).

There is also another issue, separate but related (my diffs did not
try to address it):

2) dev_change_net_namespace does not have the equivalent of
netdev_wait_allrefs. If rebroadcasts are in fact needed when
destroying a net_device then the same should apply when the net_device
is moved out of a namespace. With existing code a net_device can be
moved to a different namespace before its refcount drops to 0. Any
later broadcasts will not trigger any action in the original namespace
where they should have occurred.

I suspect that this is not catastrophic but its effects could be
subtle. This is another reason why I would avoid using
dev_change_net_namespace as a driver level solution for 1) above,
besides the fact that we would have to apply it to all drivers
affected by this (veth, vlan, macvlan, maybe more).

> It might be possible to actually build asynchronous network device
> unregistration and opportunistick batching.  Aka cleanup network devices
> much like we clean up network namespaces now, and by funnelling them all
> into a single threaded work queue that would probably stop the races, as
> well as improve performance.  I don't have the energy to do that right
> now unfortunately :(
>

That would probably solve it, but as you suggest it will take a
considerable amount of work and code re-structuring.

Francesco
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html