netdev - Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1523490467.3587.20.camel@mellanox.com>
Date:   Wed, 11 Apr 2018 23:47:50 +0000
From:   Saeed Mahameed <saeedm@...lanox.com>
To:     "eric.dumazet@...il.com" <eric.dumazet@...il.com>,
        "davem@...emloft.net" <davem@...emloft.net>
CC:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical
 section

On Wed, 2018-04-11 at 15:30 -0700, Eric Dumazet wrote:
> 
> On 04/11/2018 11:59 AM, Saeed Mahameed wrote:
> > On Tue, 2018-04-10 at 13:35 -0700, Eric Dumazet wrote:
> > > 
> > > On 04/10/2018 10:16 AM, David Miller wrote:
> > > > 
> > > > The tradeoff here is that now you are doing two unnecessary
> > > > atomic
> > > > operations per stats dump.
> > > > 
> > > > That is what the RCU lock allows us to avoid.
> > > > 
> > > 
> > > dev_hold() and dev_put() are actually per cpu increment and
> > > decrements,
> > > pretty cheap these days.
> > > 
> > 
> > Yes, i am adding only 2 cpu instructions here.
> > I think the trade-off here is too small and the price to finally
> > have
> > get_stats64 called from non atomic context is really worth it.
> 
> Oh... but you have not mentioned this reason in your changelog.
> 

from the commit message:

"This is really greedy and demanding from device drivers since
ndo_get_stats64 called from dev_seq_show while the rcu lock is held"

sorry if this wasn't clear enough I will fix this if this goes through.

> What about bonding stats ?
> 
> By sending partial patches like that, others have to take care of the
> details
> and this is not really acceptable.
> 

This is a RFC just to show the approach, if the approach is acceptable
of course i will provide a full series that will handle all other
places, the change should be the same, I already have 2 other patches
to address ovs and netlink stats, i just didn't want to waste your time
on small details like netlink messages.

> > 
> > It  looks really odd to me that the device chain locks are held for
> > such long periods, while we already have the means to avoid this,
> > same
> > goes for rtnl_lock, same trick can work here for many use cases and
> > many ndos, we are just over protective for no reasons.
> > 
> > 
> > > Problem here is that any preemption of the process holding device
> > > reference
> > > might trigger warnings in device unregister.
> > > 
> > 
> > This is true for any other place dev_hold is used,
> > as explained in the commit message dev_hold is used for a very
> > brief
> > moment before calling the stats ndo and released directly after.
> 
> Not really.
> 
> Other places usually have notifiers to remove the refcount when
> needed.
> 

Other places hold the netdev for the whole lifetime of the netdev, they
don't know when to release it, this is why we need notifiers.

in this patch the approach is:

hold
call ndo
put

notifier is not needed in this case.

> We worked quite hard in the past to remove extra dev_hold()
> (before we finally converted it to percpu counter)
> 
> > 
> > looking at 
> > 
> > netdev_wait_allrefs(...)
> > [...]
> > 	msleep(250);
> > 
> > 	refcnt = netdev_refcnt_read(dev);
> > 
> > 	if (time_after(jiffies, warning_time + 10 * HZ)) {
> > 		pr_emerg("unregister_netdevice: waiting for %s to
> > become free. Usage count = %d\n",
> > 			 dev->name, refcnt);
> > 		warning_time = jiffies;
> > 	}
> > 
> > The holder will get enough time to release the netdev way before
> > the
> > warning is triggered.
> > 
> > The warning is triggered only if someone holds the dev for more
> > than 10
> > seconds which is impossible for the stats ndo to take more than
> > this,
> > in fact i just did a quick measurement and it seems that in average
> > get_stats64 ndo takes 0.5us !
> 
> Average is nice, but what about max time ?
> 

Well if we allow devices to access HW counters via FW command
interfaces in ndo_get_stats and by testing mlx5 where we query up to 5
hw registers, it could take 100us, still this is way smaller than 10sec
 :) and it is really a nice rate to fetch HW stats on demand.

> Sleeping more than 10 seconds to satify GFP_KERNEL memory allocation
> can definitely
> happen in the real world, or simply if you get preempted by some
> RT/high prio tasks.
> 

Same issue can occur without this patch if an IRQ is triggered under
while under rcu lock.
And RT/high prio task shouldn't take more than 10sec.

In case GFP_KERNEL memory allocation takes more than 10sec you will
already get tons of OOM warnings.
Having a netdev warning in case of really the netdev is being
unregistered and ndo_get_stats was preempted for more than 10 seconds
will be the least of your problems.

> Just say no to 'might sleep in ndo_get_stats()', and you will save a
> lot of issues.

We are just being over protective here.

Just say no to 'might sleep in ndo_get_stats()' is by itself creating a
lot of issues, many drivers out there have a background thread running
N times a second just to cache HW stats. which is really silly and CPU
consuming for no reason.

The rate of ndo_get_stats should be equal to the rate the driver can
actually provide fresh stats, any background thread is just a a waste
of CPU. Counters should be fetched from HW on demand rather than
periodically for no reason.

The same goes for set_rx_mode ndo.

Bottom line it looks like the need for rcu locking today is only meant
for synchronizing accessing the netdev ndo with the device chain in
question (namespace/ovs/bonding/etc .. ).

There are many places where dev_get_stats is called from non atmoic
context where the caller knows it is safe to access the netdev ndo.

example:
net/bondign/bond_main.c: bond_enslave(..)

/* initialize slave stats */
dev_get_stats(new_slave->dev, &new_slave->slave_stats);