[<prev] [next>] [day] [month] [year] [list]
Message-ID: <506311CE.30406@hp.com>
Date: Wed, 26 Sep 2012 07:31:42 -0700
From: Don Morris <don.morris@...com>
To: linux-kernel@...r.kernel.org
CC: Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH 16/19] sched, numa: NUMA home-node selection code
Re-sending to LKML due to mailer picking up an incorrect
address. (Sorry for the dupe).
On 09/26/2012 07:26 AM, Don Morris wrote:
> Peter --
>
> You may have / probably have already seen this, and if so I
> apologize in advance (can't find any sign of a fix via any
> searches...).
>
> I picked up your August sched/numa patch set and have been
> working on it with a 2-node and a 8-node configuration. Got
> a very intermittent crash on the 2-node which of course
> hasn't reproduced since I got the crash/kdump configured.
> (I suspect it is related, however).
>
> On the 8-node, however, I very reliably got a hard lockup
> NMI after several minutes. This occurs when running Andrea's
> autonuma-benchmark
> (git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git) reliably
> with the first test (two processes, one
> thread per core/vcore, each loops over a single malloc space).
> I'll attach the full stack set from that crash.
>
> Since the NMI output seemed really consistent that the hard
> lockup stemmed from waiting for a spinlock that never seemed
> to be picked up, I turned on Lock debugging in the .config and
> got a very clear, very consistent circular dependency warning (just
> below).
>
> As far as I can tell, the warning is correct and is consistent
> with the actual NMI crash output (variant in that the "pidof"
> process on cpu 52 is going through task_sched_runtime() to do
> the task_rq_lock() operation on the numa01 process which
> results in it getting the pi_lock and waiting for
> the rq->lock when numa01 (back on CPU 0) had the rq->lock
> from scheduler_tick() and is going for the pi_lock via
> task_work_add()... ).
>
> I'm nowhere near confident enough in my knowledge of the
> nuances of run queue locking during the tick update to try
> to hack a workaround - so sorry no proposed patch fix here,
> just a bug report.
>
> On another minor note, while looking over this and of course
> noticing that most other cpus were tied up waiting for the
> page lock on one of the huge pages (THP was of course on)
> while one of them busied itself invalidating across the other
> CPUs -- the question comes to mind if that's really needed.
> Yes, it certainly is needed in the true PROT_NONE case you're
> building off of as you certainly can't allow access to a
> translation which is now supposed to be locked out, but you
> could allow transitory minor faults when going from PROT_NONE
> back to access as the fault would clear the TLB anyway (at
> least on x86, any architecture which doesn't do that would have
> to have an explicit TLB invalidation for cases where the translation
> is detected as updated anyway, so that should be okay). In your
> case, I would think the transitory faults on what's really a
> hint to the system would probably be much better than tying up
> N-1 other CPUs to do the other flush on a process that spans
> the system -- especially if the other processors are in a scenario
> where they're running that process but working on a different page
> (and hence may never even touch the page changing access anyway).
> Even in the case where you're adding the hint (access to NONE)
> you could be willing to miss an access in favor of letting the
> next context switch invalidate the TLB for you (again, there
> may be architectures where you'll never invalidate unless it is
> explicitly, I think IPF was that way but it has been a while)
> given you really need a non-trivial run time to merit doing this
> work and have a good chance of settling out to a good access
> pattern.
>
> Just a thought.
>
> Thanks for your work,
> Don Morris
>
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.6.0-rc4 #28 Not tainted
> -------------------------------------------------------
> numa01/35386 is trying to acquire lock:
> (&p->pi_lock){-.-.-.}, at: [<ffffffff81073e68>] task_work_add+0x38/0xa0
>
> but task is already holding lock:
> (&rq->lock){-.-.-.}, at: [<ffffffff81085d83>] scheduler_tick+0x53/0x150
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&rq->lock){-.-.-.}:
> [<ffffffff810b52e3>] validate_chain+0x633/0x730
> [<ffffffff810b57d2>] __lock_acquire+0x3f2/0x490
> [<ffffffff810b5959>] lock_acquire+0xe9/0x120
> [<ffffffff8152e306>] _raw_spin_lock+0x36/0x70
> [<ffffffff8108c1f1>] wake_up_new_task+0xd1/0x190
> [<ffffffff810513f2>] do_fork+0x1f2/0x280
> [<ffffffff8101bcd6>] kernel_thread+0x76/0x80
> [<ffffffff81513976>] rest_init+0x26/0xc0
> [<ffffffff81cdfeff>] start_kernel+0x3c6/0x3d3
> [<ffffffff81cdf356>] x86_64_start_reservations+0x131/0x136
> [<ffffffff81cdf45c>] x86_64_start_kernel+0x101/0x110
>
> -> #0 (&p->pi_lock){-.-.-.}:
> [<ffffffff810b48ef>] check_prev_add+0x11f/0x4e0
> [<ffffffff810b52e3>] validate_chain+0x633/0x730
> [<ffffffff810b57d2>] __lock_acquire+0x3f2/0x490
> [<ffffffff810b5959>] lock_acquire+0xe9/0x120
> [<ffffffff8152e4b5>] _raw_spin_lock_irqsave+0x55/0xa0
> [<ffffffff81073e68>] task_work_add+0x38/0xa0
> [<ffffffff810905d7>] task_tick_numa+0xb7/0xd0
> [<ffffffff8109237a>] task_tick_fair+0x5a/0x70
> [<ffffffff81085e0e>] scheduler_tick+0xde/0x150
> [<ffffffff8106267e>] update_process_times+0x6e/0x90
> [<ffffffff810ad803>] tick_sched_timer+0xa3/0xe0
> [<ffffffff8107c266>] __run_hrtimer+0x106/0x1c0
> [<ffffffff8107c5f0>] hrtimer_interrupt+0x120/0x260
> [<ffffffff81538fdd>] smp_apic_timer_interrupt+0x8d/0xa3
> [<ffffffff81537eaf>] apic_timer_interrupt+0x6f/0x80
> [<ffffffff8152e326>] _raw_spin_lock+0x56/0x70
> [<ffffffff811488e8>] do_anonymous_page+0x1e8/0x270
> [<ffffffff8114d1fc>] handle_pte_fault+0x9c/0x2a0
> [<ffffffff8114d5a0>] handle_mm_fault+0x1a0/0x1c0
> [<ffffffff81532de1>] do_page_fault+0x421/0x450
> [<ffffffff8152f2d5>] page_fault+0x25/0x30
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&rq->lock);
> lock(&p->pi_lock);
> lock(&rq->lock);
> lock(&p->pi_lock);
>
> *** DEADLOCK ***
>
> 3 locks held by numa01/35386:
> #0: (&mm->mmap_sem){++++++}, at: [<ffffffff81532bbc>]
> do_page_fault+0x1fc/0x450
> #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff811488e8>]
> do_anonymous_page+0x1e8/0x270
> #2: (&rq->lock){-.-.-.}, at: [<ffffffff81085d83>]
> scheduler_tick+0x53/0x150
>
> stack backtrace:
> Pid: 35386, comm: numa01 Not tainted 3.6.0-rc4 #28
> Call Trace:
> <IRQ> [<ffffffff810b36a7>] print_circular_bug+0xf7/0x120
> [<ffffffff8108f5d7>] ? update_sd_lb_stats+0x347/0x700
> [<ffffffff810b48ef>] check_prev_add+0x11f/0x4e0
> [<ffffffff8101afe5>] ? native_sched_clock+0x35/0x80
> [<ffffffff8101a5d9>] ? sched_clock+0x9/0x10
> [<ffffffff8108d82f>] ? sched_clock_cpu+0x4f/0x110
> [<ffffffff810b52e3>] validate_chain+0x633/0x730
> [<ffffffff8101a5d9>] ? sched_clock+0x9/0x10
> [<ffffffff810b57d2>] __lock_acquire+0x3f2/0x490
> [<ffffffff810afc5d>] ? trace_hardirqs_off+0xd/0x10
> [<ffffffff810b5959>] lock_acquire+0xe9/0x120
> [<ffffffff81073e68>] ? task_work_add+0x38/0xa0
> [<ffffffff8152e4b5>] _raw_spin_lock_irqsave+0x55/0xa0
> [<ffffffff81073e68>] ? task_work_add+0x38/0xa0
> [<ffffffff81073e68>] task_work_add+0x38/0xa0
> [<ffffffff810905d7>] task_tick_numa+0xb7/0xd0
> [<ffffffff8109237a>] task_tick_fair+0x5a/0x70
> [<ffffffff81085e0e>] scheduler_tick+0xde/0x150
> [<ffffffff8106267e>] update_process_times+0x6e/0x90
> [<ffffffff810ad803>] tick_sched_timer+0xa3/0xe0
> [<ffffffff8107c266>] __run_hrtimer+0x106/0x1c0
> [<ffffffff810ad760>] ? tick_nohz_restart+0xa0/0xa0
> [<ffffffff8107c5f0>] hrtimer_interrupt+0x120/0x260
> [<ffffffff81538fdd>] smp_apic_timer_interrupt+0x8d/0xa3
> [<ffffffff81537eaf>] apic_timer_interrupt+0x6f/0x80
> <EOI> [<ffffffff8108d93b>] ? local_clock+0x4b/0x70
> [<ffffffff812754e2>] ? do_raw_spin_lock+0xb2/0x140
> [<ffffffff81275509>] ? do_raw_spin_lock+0xd9/0x140
> [<ffffffff8152e326>] _raw_spin_lock+0x56/0x70
> [<ffffffff811488e8>] ? do_anonymous_page+0x1e8/0x270
> [<ffffffff811488e8>] do_anonymous_page+0x1e8/0x270
> [<ffffffff8114d1fc>] handle_pte_fault+0x9c/0x2a0
> [<ffffffff81532bbc>] ? do_page_fault+0x1fc/0x450
> [<ffffffff810b5ddf>] ? __lock_release+0x14f/0x180
> [<ffffffff8114d5a0>] handle_mm_fault+0x1a0/0x1c0
> [<ffffffff8107d1c5>] ? down_read_trylock+0x55/0x70
> [<ffffffff81532de1>] do_page_fault+0x421/0x450
> [<ffffffff810b5ddf>] ? __lock_release+0x14f/0x180
> [<ffffffff810b4522>] ? trace_hardirqs_on_caller+0x152/0x1c0
> [<ffffffff810b459d>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff8152ed60>] ? _raw_spin_unlock_irq+0x30/0x40
> [<ffffffff8152d670>] ? __schedule+0x610/0x690
> [<ffffffff8126f03d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [<ffffffff8152f2d5>] page_fault+0x25/0x30
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists