linux-kernel - Re: mmotm 2009-04-10-02-21 uploaded

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.00.0904130847500.4583@localhost.localdomain>
Date:	Mon, 13 Apr 2009 09:04:23 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Valdis.Kletnieks@...edu, Mike Travis <travis@....com>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	mm-commits@...r.kernel.org, Rusty Russell <rusty@...tcorp.com.au>,
	Dave Jones <davej@...hat.com>, Ingo Molnar <mingo@...e.hu>
Subject: Re: mmotm 2009-04-10-02-21 uploaded - forkbombed by work_for_cpu

On Sat, 11 Apr 2009, Valdis.Kletnieks@...edu wrote:
> 
> Probable cause for my problem:
> 
> arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.c calls work_on_cpu(). We get into a
> state where we have enough activity to kick us to a high CPU speed, and then
> the activity of writing 90 acct records per sec keeps us there - with continual
> callbacks to see if we can drop the CPU speed.

Ok, I think that that work_on_cpu() commit is broken, but I _also_ think 
that cpufreq is doing something fairly insane.

This behavior seems to be triggered by the "ondemand" policy case, btw, 
and it literally does basically:

  dbs_check_cpu:
    for_each_cpu(j, policy->cpus)
      ...
      freq_avg = __cpufreq_driver_getavg(policy, j);

where "__cpufreq_driver_getavg()" will do "freq->getavg(policy, cpu)" and 
then acpi-cpufreq.c will do that "work_on_cpu()" as part of the call to 
"get_measured_perf()".

So pretty much _all_ use is going to always effectively do a broadcast 
"work on each cpu" thing. That's always going to be pretty damn expensive.

And there's no _reason_. As far as I can tell, that ACPI cpufreq thing 
doesn't _need_ any "process context".  That "get_measured_perf()" will 
just do a single read_measured_perf_ctrs() call, and all that does is two 
'rdmsr()' calls.

So afaik, acpi-cpufreq.c should not use "work_on_cpu()" for that at all. 
It should just do a smp_call_function_single(). 

So I do think Andrew's commit is broken and we should think about it a bit 
more, but I also think that Valdis' problem comes from acpi-cpufreq just 
being damn stupid. Doing a smp_call_function_single() to read two MSR's is 
going to be a _lot_ more efficient than doing that crazy work_on_cpu() for 
that.

So the _real_ problem came through the commits like

    cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
    cpumask: use work_on_cpu in acpi-cpufreq.c for read_measured_perf_ctrs

that were meant to reduce stack usage with big cpu masks. And sure, the 
_old_ way of doing it was also stupid (it rescheduled the process to the 
other CPU by using cpus_allowed()).

Mike, Ingo?

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/