[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAErSpo65Q0+UT3sS6LcyXohkCMfjweoS_j+okiYq4bdjFKjdHw@mail.gmail.com>
Date: Thu, 18 Apr 2013 15:40:01 -0600
From: Bjorn Helgaas <bhelgaas@...gle.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: Or Gerlitz <ogerlitz@...lanox.com>,
Ming Lei <ming.lei@...onical.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
David Miller <davem@...emloft.net>,
Roland Dreier <roland@...nel.org>,
netdev <netdev@...r.kernel.org>, Yan Burman <yanb@...lanox.com>,
Jack Morgenstein <jackm@....mellanox.co.il>,
Tejun Heo <tj@...nel.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
Subject: Re: [PATCHv2 for-3.9] pci: avoid work_on_cpu for nested SRIOV probes
On Thu, Apr 18, 2013 at 2:08 PM, Michael S. Tsirkin <mst@...hat.com> wrote:
> The following lockdep report triggers since 3.9-rc1:
>
> =============================================
> [ INFO: possible recursive locking detected ]
> 3.9.0-rc1 #96 Not tainted
> ---------------------------------------------
> kworker/0:1/734 is trying to acquire lock:
> ((&wfc.work)){+.+.+.}, at: [<ffffffff81066cb0>] flush_work+0x0/0x250
>
> but task is already holding lock:
> ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
> process_one_work+0x162/0x4c0
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock((&wfc.work));
> lock((&wfc.work));
>
> *** DEADLOCK ***
>
> May be due to missing lock nesting notation
>
> 3 locks held by kworker/0:1/734:
> #0: (events){.+.+.+}, at: [<ffffffff81064352>]
> process_one_work+0x162/0x4c0
> #1: ((&wfc.work)){+.+.+.}, at: [<ffffffff81064352>]
> process_one_work+0x162/0x4c0
> #2: (&__lockdep_no_validate__){......}, at: [<ffffffff812db225>]
> device_attach+0x25/0xb0
>
> stack backtrace:
> Pid: 734, comm: kworker/0:1 Not tainted 3.9.0-rc1 #96
> Call Trace:
> [<ffffffff810948ec>] validate_chain+0xdcc/0x11f0
> [<ffffffff81095150>] __lock_acquire+0x440/0xc70
> [<ffffffff81095150>] ? __lock_acquire+0x440/0xc70
> [<ffffffff810959da>] lock_acquire+0x5a/0x70
> [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
> [<ffffffff81066cf5>] flush_work+0x45/0x250
> [<ffffffff81066cb0>] ? wq_worker_waking_up+0x60/0x60
> [<ffffffff810922be>] ? mark_held_locks+0x9e/0x130
> [<ffffffff81066a96>] ? queue_work_on+0x46/0x90
> [<ffffffff810925dd>] ? trace_hardirqs_on_caller+0xfd/0x190
> [<ffffffff8109267d>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff81066f74>] work_on_cpu+0x74/0x90
> [<ffffffff81063820>] ? keventd_up+0x20/0x20
> [<ffffffff8121fd30>] ? pci_pm_prepare+0x60/0x60
> [<ffffffff811f9293>] ? cpumask_next_and+0x23/0x40
> [<ffffffff81220a1a>] pci_device_probe+0xba/0x110
> [<ffffffff812dadca>] ? driver_sysfs_add+0x7a/0xb0
> [<ffffffff812daf1f>] driver_probe_device+0x8f/0x230
> [<ffffffff812db170>] ? __driver_attach+0xb0/0xb0
> [<ffffffff812db1bb>] __device_attach+0x4b/0x60
> [<ffffffff812d9314>] bus_for_each_drv+0x64/0x90
> [<ffffffff812db298>] device_attach+0x98/0xb0
> [<ffffffff81218474>] pci_bus_add_device+0x24/0x50
> [<ffffffff81232e80>] virtfn_add+0x240/0x3e0
> [<ffffffff8146ce3d>] ? _raw_spin_unlock_irqrestore+0x3d/0x80
> [<ffffffff812333be>] pci_enable_sriov+0x23e/0x500
> [<ffffffffa011fa1a>] __mlx4_init_one+0x5da/0xce0 [mlx4_core]
> [<ffffffffa012016d>] mlx4_init_one+0x2d/0x60 [mlx4_core]
> [<ffffffff8121fd79>] local_pci_probe+0x49/0x80
> [<ffffffff81063833>] work_for_cpu_fn+0x13/0x20
> [<ffffffff810643b8>] process_one_work+0x1c8/0x4c0
> [<ffffffff81064352>] ? process_one_work+0x162/0x4c0
> [<ffffffff81064cfb>] worker_thread+0x30b/0x430
> [<ffffffff810649f0>] ? manage_workers+0x340/0x340
> [<ffffffff8106cea6>] kthread+0xd6/0xe0
> [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
> [<ffffffff8146daac>] ret_from_fork+0x7c/0xb0
> [<ffffffff8106cdd0>] ? __init_kthread_worker+0x70/0x70
>
> The issue is that a driver, in it's probe function, calls
> pci_sriov_enable so a PF device probe causes VF probe (AKA nested
> probe). Each probe in pci_device_probe which is (normally) run through
> work_on_cpu (this is to get the right numa node for memory allocated by
> the driver). In turn work_on_cpu does this internally:
>
> schedule_work_on(cpu, &wfc.work);
> flush_work(&wfc.work);
>
> So if you are running probe on CPU1, and cause another
> probe on the same CPU, this will try to flush
> workqueue from inside same workqueue which causes
> a lockep warning.
>
> Nested probing might be tricky to get right generally.
>
> But for pci_sriov_enable, the situation is actually very simple: all VFs
> naturally have same affinity as the PF, and cpumask_any_and is actually
> same as cpumask_first_and, so it always gives us the same CPU.
> So let's just detect that, and run the probing for VFs locally without a
> workqueue.
>
> This is hardly elegant, but looks to me like an appropriate quick fix
> for 3.9.
>
> Tested-by: Or Gerlitz <ogerlitz@...lanox.com>
> Signed-off-by: Michael S. Tsirkin <mst@...hat.com>
> Acked-by: Tejun Heo <tj@...nel.org>
Thanks, Michael. I put this in my for-linus branch:
http://git.kernel.org/cgit/linux/kernel/git/helgaas/pci.git/log/?h=for-linus
I'll send a pull request to Linus today.
Bjorn
> ---
>
> Changes from v1:
> - clarified commit log and added Ack by Tejun Heo
> patch is unchanged.
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 1fa1e48..6eeb5ec 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -286,8 +286,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
> int cpu;
>
> get_online_cpus();
> cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
> - if (cpu < nr_cpu_ids)
> + if (cpu != raw_smp_processor_id() && cpu < nr_cpu_ids)
> error = work_on_cpu(cpu, local_pci_probe, &ddi);
> else
> error = local_pci_probe(&ddi);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists