linux-kernel - Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <AANLkTiktBNr3pVOHwxX3piwagtkGxHpp5TepBFS3UGhb@mail.gmail.com>
Date:	Tue, 20 Jul 2010 09:55:29 -0700
From:	Venkatesh Pallipadi <venki@...gle.com>
To:	Martin Schwidefsky <schwidefsky@...ibm.com>
Cc:	Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...e.hu>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Balbir Singh <balbir@...ux.vnet.ibm.com>,
	Paul Menage <menage@...gle.com>, linux-kernel@...r.kernel.org,
	Paul Turner <pjt@...gle.com>,
	Heiko Carstens <heiko.carstens@...ibm.com>,
	Paul Mackerras <paulus@...ba.org>,
	Tony Luck <tony.luck@...el.com>
Subject: Re: [PATCH 0/4] Finer granularity and task/cgroup irq time accounting

On Tue, Jul 20, 2010 at 12:55 AM, Martin Schwidefsky
<schwidefsky@...ibm.com> wrote:
> On Mon, 19 Jul 2010 16:57:11 -0700
> Venkatesh Pallipadi <venki@...gle.com> wrote:
>
>> Currently, the softirq and hardirq time reporting is only done at the
>> CPU level. There are usecases where reporting this time against task
>> or task groups or cgroups will be useful for user/administrator
>> in terms of resource planning and utilization charging. Also, as the
>> accoounting is already done at the CPU level, reporting the same at
>> the task level does not add any significant computational overhead
>> other than task level storage (patch 1).
>
> I never understood why the softirq and hardirq time gets accounted to a
> task at all. Why is it that the poor task that is running gets charged
> with the cpu time of an interrupt that has nothing to do with the task?
> I consider this to be a bug, and now this gets formalized in the
> taskstats interface? Imho not a good idea.

Agree that this is a bug. I started by looking at resolving that. But,
it was not exactly easy. Ideally we want irq times to be charged to
right task as much as possible. With things like network rcv softirq
for example, there is a task thats is going to consume the packet
eventually that should be charged. If we cant find a suitable match we
may have to charge it to some system thread. Things like threaded
interrupts will mitigate this problem a bit. But, until we have a good
enough solution, this bug will be around with us.

This change takes a small step giving hint about this to
user/administrator who can take some corrective action based on it.
Next step is to give CFQ scheduler some info about this and I am
working on a patch for that. That will help in load balancing
decisions, with irq heavy CPU not trying to get equal weight-age as
other CPU. I don't think these interfaces are binding in any way. If
and when we have tasks not being charged for irq, we can simply report
"0" in these interfaces (there is some precedent for this in
/proc/<pid>stat output already).

>> The softirq/hardirq statistics commonly done based on tick based sampling.
>> Though some archs have CONFIG_VIRT_CPU_ACCOUNTING based fine granularity
>> accounting. Having similar mechanism to get fine granularity accounting
>> on x86 will be a major challenge, given the state of TSC reliability
>> on various platforms and also the overhead it may add in common paths
>> like syscall entry exit.
>>
>> An alternative is to have a generic (sched_clock based) and configurable
>> fine-granularity accounting of si and hi time which can be reported
>> over the /proc/<pid>/stat API (patch 2).
>
> To get fine granular accounting for interrupts you need to do a
> sched_clock call on irq entry and another one on irq exit. Isn't that
> too expensive on a x86 system? (I do think this is a good idea but
> still there is the worry about the overhead).

On x86: Yes. Overhead is a potential problem. Thats the reason I had
this inside a CONFIG option. But, I have tested this with few
workloads on different systems released in past two years timeframe
and I did  not see any measurable overhead. Note that this is used
only when sched_clock is based off of TSC and not when it is based on
jiffies. The sched_clock overhead I measured on different platforms
was in 30-150 cycles range, which probably isn't going to be highly
visible in generic workloads.
Archs like s390/powerpc/ia64 already do this kind of accounting with
VIRT_CPU_ACCOUNTING. So, this patch will give them task and cgroup
level info free of charge (other than potential bugs with this code
change :-)).

Thanks,
Venki
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/