lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 3 Oct 2012 11:24:13 +0200 (CEST)
From:	Jiri Kosina <jkosina@...e.cz>
To:	"Srivatsa S. Bhat" <srivatsa.bhat@...ux.vnet.ibm.com>
Cc:	paulmck@...ux.vnet.ibm.com,
	"Paul E. McKenney" <paul.mckenney@...aro.org>,
	Josh Triplett <josh@...htriplett.org>,
	linux-kernel@...r.kernel.org
Subject: Re: Lockdep complains about commit 1331e7a1bb ("rcu: Remove
 _rcu_barrier() dependency on __stop_machine()")

On Wed, 3 Oct 2012, Srivatsa S. Bhat wrote:

> >> static void cpu_hotplug_begin(void)
> >> {
> >>         cpu_hotplug.active_writer = current;
> >>
> >>         for (;;) {
> >>                 mutex_lock(&cpu_hotplug.lock);
> >>                 if (likely(!cpu_hotplug.refcount))   <================ This one!
> >>                         break;
> >>                 __set_current_state(TASK_UNINTERRUPTIBLE);
> >>                 mutex_unlock(&cpu_hotplug.lock);
> >>                 schedule();
> >>         }   
> >> }
> > 
> > I acutally just came to the same conclusion (7 hours of sleep later, the 
> > mind indeed seems to be brighter ... what a poet I am).
> > 
> > Lockdep doesn't know about this semantics of cpu_hotplug_begin(), and 
> > therefore gets confused by the fact that mutual exclusion is actually 
> > achieved through the refcount instead of mutex (and the same apparently 
> > happened to me).
> 
> No, that's not the problem. Lockdep is fine. The calltrace clearly shows that
> our refcounting has messed up somewhere. As a result, we really *are* running
> a hotplug-reader and a hotplug-writer at the same time! We really need to fix
> *that*! So please try the second debug patch I sent just now (with the BUG_ON()
> in put_online_cpus()). We need to know who is calling put_online_cpus() twice
> and fix that culprit!

I don't think so.

Lockdep is complaining, because

(a) during system bootup, the smp_init() -> cpu_up() -> cpuup_callback() 
    teaches him about hotplug.lock -> slab_mutex dependency

(b) many many jiffies later, nf_conntrack_cleanup_net() calls 
    kmem_cache_destroy(), which introduces slab_mutex -> hotplug.lock 
    dependency

Lockdep rightfully (from his POV) sees this as potential ABBA, and reports 
it, it's as simple as that.
It has no way of knowing the fact that the ABBA can actually never happen, 
because of special semantics of cpu_hotplug.refcount and it's handling in 
cpu_hotplug_begin().

The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin() 
until everyone who called get_online_cpus() will call put_online_cpus()" 
is totally invisible to lockdep.

> > So right, now I agree that the deadlock scenario I have come up with is 
> > indeed bogus (*), and we just have to annotate this fact to lockdep 
> > somehow.
> 
> Yes, the deadlock scenario is bogus, but the refcounting leak is for real
> and needs fixing.

With your patch applied, the BUG_ON() in put_online_cpus() didn't trigger 
for me at all. Which is what I expected.

> I'm fine with this, but the real problem is elsewhere, like I mentioned above.
> This one is only a good-to-have, not a fix.
>  
> > (*) I have seen machine locking hard reproducibly, but that was only with 
> > additional Paul's patch, so I guess the lock order there actually was 
> > wrong
> 
> If refcounting was working fine, Paul's patch wouldn't have caused *any* issues.
> With that patch in place, the 2 places where rcu_barrier() get invoked (ie.,
> kmem_cache_destroy() and deactivate_locked_super()) both start waiting on
> get_online_cpus() until the slab cpu hotplug notifier as well as the entire
> cpu_up operation completes. Absolutely no problem in that! So the fact that
> you are seeing lock-ups here is another indication that the problem is really
> elsewhere!

I don't agree. The reason why Paul's patch (1331e7a1bb) started to trigger 
this, is that (b) above doesn't exist in pre-1331e7a1bb kernels.

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ