[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151103125504.6649138f@tlielax.poochiereds.net>
Date: Tue, 3 Nov 2015 12:55:04 -0500
From: Jeff Layton <jlayton@...chiereds.net>
To: Chris Worley <chris.worley@...marydata.com>
Cc: Tejun Heo <tj@...nel.org>, linux-kernel@...r.kernel.org,
bfields@...ldses.org,
Michael Skralivetsky <michael.skralivetsky@...marydata.com>,
Trond Myklebust <trond.myklebust@...marydata.com>,
Shaohua Li <shli@...com>
Subject: Re: timer code oops when calling mod_delayed_work
On Mon, 2 Nov 2015 20:33:39 -0500
Jeff Layton <jlayton@...chiereds.net> wrote:
> On Mon, 2 Nov 2015 14:56:33 -0500
> Jeff Layton <jlayton@...chiereds.net> wrote:
>
> > On Mon, 2 Nov 2015 09:48:41 -1000
> > Chris Worley <chris.worley@...marydata.com> wrote:
> >
> > > On Sat, Oct 31, 2015 at 11:54 AM, Jeff Layton <jlayton@...chiereds.net> wrote:
> > > > On Sat, 31 Oct 2015 17:31:07 -0400
> > > > Tejun Heo <tj@...nel.org> wrote:
> > > ...
> > > >>
> > > >> > I have asked Chris and Michael to see if they can bisect it down, but
> > > >> > it may be a bit before they can get that done. Any insight you might
> > > >> > have in the meantime would helpful.
> > > >>
> > > >> Yeah, I'd love to find out how reproducible the issue is. If the
> > > >> problem is rarely reproducible, it might make sense to try
> > > >> instrumentation before trying bisection as it *could* be a latent bug
> > > >> which has been there all along and bisecting to the commit introducing
> > > >> the code wouldn't help us too much.
> > > >>
> > > >
> > > > It seems fairly reproducible, at least on v4.3-rc7 kernels:
> > > >
> > > > This came about when I asked them to perf test some nfsd patches that I
> > > > have queued up. I patched a Fedora 4.3-rc7 kernel and wanted to see
> > > > what the perf delta was (with NFSv3, fwiw):
> > > >
> > > > Patched kernels here: http://koji.fedoraproject.org/koji/taskinfo?taskID=11598089
> > > >
> > > > Unpatched kernels here: http://koji.fedoraproject.org/koji/buildinfo?buildID=694377
> > > >
> > > > Michael was using the SPEC SFS VDI workload to test, and was able to
> > > > get the same panic on both kernels. So it does seem to be reproducible.
> > > > It might even be possible to tune the VM to make the shrinker fire more
> > > > often, which may help tickle this more.
> > > >
> > > > In any case, I've asked them to try something v4.2-ish and see if it's
> > > > reproducible there, and then try v4.1 if it is. I figure anything
> > > > earlier is probably not worth testing if it still fails on v4.1. If it
> > > > turns out not to be reproducible on those earlier kernels then we can
> > > > bisect from there to track it down.
> > >
> > > The trick seems to be the NFS thread count: I initially though this
> > > was SFS/VDI specific, but when I ratcheted up the thread count to what
> > > Michael was using, 256 threads oopses on fio (throughput) benchmarks
> > > too.
> > >
> > > In bisecting kernels, it appeared between 4.2.3-200 and 4.2.5-200 (all
> > > the 4.2.4 kernels were bad).
> > >
> > > Jeff has a lead on this...
> > >
> > > Chris
> >
> > Thanks Chris. This commit went in between those kernels:
> >
> > commit a7c571f2e3ff9243ba41c242486f53dbca37d864
> > Author: Shaohua Li <shli@...com>
> > Date: Wed Sep 30 09:05:30 2015 -0700
> >
> > workqueue: make sure delayed work run in local cpu
> >
> > commit 874bbfe600a660cba9c776b3957b1ce393151b76 upstream.
> >
> >
> > ...and I suspect that it's the culprit. That commit causes this code to
> > always use add_timer_on, which seems to have different semantics from
> > add_timer.
> >
> > I'm going to build a v4.2.5 kernel with that patch reverted to confirm
> > it, but it seems likely...
> >
>
> (dropping Lai's address, and adding Shaohua Li's)
>
> Ok, I built a kernel with that patch reverted and that seems to fix the
> problem.
>
> Looking at the patch, I guess the main difference is that we're no
> longer using add_timer for unbound workqueue tasks. That said, we
> should have possession of the PENDING bit before calling add_timer_on.
>
> So...I'm a little stumped. Any thoughts as to where the race could be?
Ok, I've been able to reproduce this with a small module that I wrote
as a reproducer:
[ 197.502139] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 197.503073] IP: [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[ 197.503752] PGD 0
[ 197.504061] Oops: 0002 [#1] SMP
[ 197.504449] Modules linked in: wqthrash(OE-) snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core xfs snd_hwdep snd_seq libcrc32c snd_seq_device ppdev snd_pcm snd_timer snd soundcore parport_pc e1000 i2c_piix4 parport joydev virtio_balloon pvpanic acpi_cpufreq nfsd nfs_acl lockd auth_rpcgss grace sunrpc virtio_blk virtio_console qxl drm_kms_helper ttm virtio_pci virtio_ring serio_raw virtio drm ata_generic pata_acpi
[ 197.509358] CPU: 1 PID: 794 Comm: kworker/u8:9 Tainted: G OE 4.3.0-0.rc7.git2.1.fc24.x86_64 #1
[ 197.510767] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 197.511850] Workqueue: wqthrash wqthrash_workfunc [wqthrash]
[ 197.513141] task: ffff88020f918000 ti: ffff8802143a8000 task.ti: ffff8802143a8000
[ 197.513958] RIP: 0010:[<ffffffff8112d67c>] [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[ 197.514927] RSP: 0018:ffff8802143abc68 EFLAGS: 00010086
[ 197.515539] RAX: dead000000000200 RBX: ffffffffa022e050 RCX: 000000000000000f
[ 197.516552] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[ 197.517374] RBP: ffff8802143abc88 R08: 0000000000000001 R09: 0000000000000001
[ 197.518194] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 197.518992] R13: ffff880216fce2c0 R14: 0000000000000001 R15: 00000000000003e8
[ 197.519817] FS: 0000000000000000(0000) GS:ffff880217000000(0000) knlGS:0000000000000000
[ 197.520738] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 197.521410] CR2: 0000000000000000 CR3: 00000000daa9b000 CR4: 00000000000006e0
[ 197.522234] Stack:
[ 197.522584] ffffffffa022e050 0000000000000000 ffff880216fce2c0 ffff88021688f800
[ 197.524102] ffff8802143abcc0 ffffffff8112db72 0000000000000086 00000000ba809f75
[ 197.525594] ffffffffa022e000 ffff8802143abd00 0000000000000001 ffff8802143abcf0
[ 197.527082] Call Trace:
[ 197.527581] [<ffffffff8112db72>] del_timer+0x62/0xa0
[ 197.532565] [<ffffffff810c913e>] try_to_grab_pending+0xce/0x190
[ 197.533253] [<ffffffff810c9242>] mod_delayed_work_on+0x42/0xb0
[ 197.533965] [<ffffffffa022c081>] wqthrash_workfunc+0x61/0x90 [wqthrash]
[ 197.534766] [<ffffffff810c8470>] process_one_work+0x230/0x6a0
[ 197.535683] [<ffffffff810c83d9>] ? process_one_work+0x199/0x6a0
[ 197.536378] [<ffffffff810c892e>] worker_thread+0x4e/0x450
[ 197.537004] [<ffffffff810c88e0>] ? process_one_work+0x6a0/0x6a0
[ 197.537724] [<ffffffff810cf981>] kthread+0x101/0x120
[ 197.538662] [<ffffffff81104fa9>] ? trace_hardirqs_on_caller+0x129/0x1b0
[ 197.539926] [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[ 197.541160] [<ffffffff81864fef>] ret_from_fork+0x3f/0x70
[ 197.542187] [<ffffffff810cf880>] ? kthread_create_on_node+0x250/0x250
[ 197.543416] Code: 3b 89 e4 00 e8 06 64 ff ff 85 c0 74 0d 80 3d 45 61 e4 00 00 0f 84 c4 00 00 00 65 ff 0d de e7 ed 7e 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 45 84 f6 74 08 48 c7 43 08 00 00 00
[ 197.547527] RIP [<ffffffff8112d67c>] detach_if_pending+0x6c/0x190
[ 197.548481] RSP <ffff8802143abc68>
[ 197.549220] CR2: 0000000000000000
[ 197.549883] ---[ end trace 10dec6677e9e3a68 ]---
Apply this patch, build a wqthrash.ko module, plug it in and watch it
explode. It basically just spawns a bunch of (immediate) workqueue jobs
that attempt to schedule and modify a single piece of delayed_work.
I still don't have a lead on the actual bug, but this does seem to
tickle it fairly easily.
-----------------------8<------------------------------
[PATCH] wqthrash: add module that will thrash the workqueue code
...let's see if we can tickle the races in the delayed_work
infrastructure.
Signed-off-by: Jeff Layton <jeff.layton@...marydata.com>
---
samples/Makefile | 3 +-
samples/wqthrash/Makefile | 1 +
samples/wqthrash/wqthrash.c | 94 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 97 insertions(+), 1 deletion(-)
create mode 100644 samples/wqthrash/Makefile
create mode 100644 samples/wqthrash/wqthrash.c
diff --git a/samples/Makefile b/samples/Makefile
index f00257bcc5a7..7e92dc91879d 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,5 @@
# Makefile for Linux samples code
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ trace_events/ livepatch/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ seccomp/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ rpmsg/ \
+ seccomp/ wqthrash/
diff --git a/samples/wqthrash/Makefile b/samples/wqthrash/Makefile
new file mode 100644
index 000000000000..da4c36a2c7d5
--- /dev/null
+++ b/samples/wqthrash/Makefile
@@ -0,0 +1 @@
+obj-m += wqthrash.o
diff --git a/samples/wqthrash/wqthrash.c b/samples/wqthrash/wqthrash.c
new file mode 100644
index 000000000000..b1c52c20fb8f
--- /dev/null
+++ b/samples/wqthrash/wqthrash.c
@@ -0,0 +1,94 @@
+/*
+ * wqthrash.c -- thrash the delayed workqueue rescheduling code
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/workqueue.h>
+#include <linux/delay.h>
+#include <linux/sched.h>
+
+#define ITERATIONS (102400)
+#define DELAY (HZ)
+#define THRASHERS (256)
+
+static void
+dummy_work(struct work_struct *unused)
+{
+ schedule_timeout_uninterruptible(1);
+}
+
+static DECLARE_DELAYED_WORK(wqthrash_delayed_work, dummy_work);
+
+static void
+wqthrash_workfunc(struct work_struct *unused)
+{
+ int i;
+
+ for (i = 0; i < ITERATIONS; ++i) {
+ schedule_delayed_work(&wqthrash_delayed_work, DELAY);
+ cond_resched();
+ mod_delayed_work(system_wq, &wqthrash_delayed_work, DELAY);
+ cond_resched();
+ }
+}
+
+static struct workqueue_struct *wq;
+static struct work_struct *tw;
+
+static int
+wqthrash_init(void)
+{
+ int i;
+
+ wq = alloc_workqueue("wqthrash", WQ_UNBOUND, 0);
+ if (!wq)
+ return -ENOMEM;
+
+ tw = kcalloc(THRASHERS, sizeof(*tw), GFP_KERNEL);
+ if (!tw) {
+ destroy_workqueue(wq);
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < THRASHERS; ++i) {
+ INIT_WORK(&tw[i], wqthrash_workfunc);
+ queue_work(wq, &tw[i]);
+ }
+ return 0;
+}
+
+static void
+wqthrash_exit(void)
+{
+ int i;
+
+ for (i = 0; i < THRASHERS; ++i)
+ flush_work(&tw[i]);
+
+ kfree(tw);
+ destroy_workqueue(wq);
+ cancel_delayed_work_sync(&wqthrash_delayed_work);
+}
+
+module_init(wqthrash_init);
+module_exit(wqthrash_exit);
+MODULE_LICENSE("GPL");
--
2.4.3
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists