lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 04 Dec 2020 16:19:52 -0500
From:   Qian Cai <qcai@...hat.com>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     Peter Zijlstra <peterz@...radead.org>, tglx@...utronix.de,
        mingo@...nel.org, linux-kernel@...r.kernel.org,
        bigeasy@...utronix.de, qais.yousef@....com, swood@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vincent.donnefort@....com,
        tj@...nel.org, ouwen210@...mail.com
Subject: Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU
 hotplug cooperative

On Tue, 2020-11-17 at 19:28 +0000, Valentin Schneider wrote:
> We did have some breakage in that area, but all the holes I was aware of
> have been plugged. What would help here is to see which tasks are still
> queued on that outgoing CPU, and their recent activity.
> 
> Something like
> - ftrace_dump_on_oops on your kernel cmdline
> - trace-cmd start -e 'sched:*'
>  <start the test here>
> 
> ought to do it. Then you can paste the (tail of the) ftrace dump.
> 
> I also had this laying around, which may or may not be of some help:

Okay, your patch did not help, since it can still be reproduced using this,

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug04.sh

# while :; do cpuhotplug04.sh -l 1; done

The ftrace dump has too much output on this 256-CPU system, so I have not had
the patient to wait for it to finish after 15-min. But here is the log capturing
so far (search for "kernel BUG" there).

http://people.redhat.com/qcai/console.log

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a6aaf9fb3400..c4a4cb8b47a2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7534,7 +7534,25 @@ int sched_cpu_dying(unsigned int cpu)
>  	sched_tick_stop(cpu);
>  
>  	rq_lock_irqsave(rq, &rf);
> -	BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> +
> +	if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {
> +		struct task_struct *g, *p;
> +
> +		pr_crit("CPU%d nr_running=%d\n", cpu, rq->nr_running);
> +		rcu_read_lock();
> +		for_each_process_thread(g, p) {
> +			if (task_cpu(p) != cpu)
> +				continue;
> +
> +			if (!task_on_rq_queued(p))
> +				continue;
> +
> +			pr_crit("\tp=%s\n", p->comm);
> +		}
> +		rcu_read_unlock();
> +		BUG();
> +	}
> +
>  	rq_unlock_irqrestore(rq, &rf);
>  
>  	calc_load_migrate(rq);
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ