linux-kernel - Re: [PATCH 00/36] cputime: Convert core use of cputime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161121162003.GB7554@lerouge>
Date:   Mon, 21 Nov 2016 17:20:06 +0100
From:   Frederic Weisbecker <fweisbec@...il.com>
To:     Martin Schwidefsky <schwidefsky@...ibm.com>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Tony Luck <tony.luck@...el.com>,
        Wanpeng Li <wanpeng.li@...mail.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Mackerras <paulus@...ba.org>,
        Ingo Molnar <mingo@...nel.org>,
        Fenghua Yu <fenghua.yu@...el.com>,
        Rik van Riel <riel@...hat.com>,
        Stanislaw Gruszka <sgruszka@...hat.com>
Subject: Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs

On Mon, Nov 21, 2016 at 07:59:56AM +0100, Martin Schwidefsky wrote:
> On Fri, 18 Nov 2016 15:47:02 +0100
> Frederic Weisbecker <fweisbec@...il.com> wrote:
> > Just because some code isn't too complex doesn't mean we really want to keep it.
> > I get regular questions about what unit does cputime_t map to on a given
> > configuration. Everybody gets confused about that. On many of the
> > patches we got on cputime for the last years, I had to fix quite some issues
> > with bad granularity assumption. In fact most fixes that came to kernel/sched/cputime.c
> > recently, after merge or review, were about people getting confused with cputime_t granularity.
> 
> These regular question you get about the cputime_t is exactly what I was referring
> to. If the value would just be a u64 the guys asking the question about cputime_t
> would just assume the value to be nano-seconds and then go ahead and break things.

Sure, replacing cputime_t with u64 without changing the unit wouldn't help. But changing
it to nsecs and expect people to deduce it from the u64 type sounds a good direction.

> 
> > Especially for stats that come from nsecs clocks (steal and irqtime), we always have to maintain an
> > accumulator and make sure we don't lose some nanosec deltas.
> 
> Yes, for the CONFIG_IRQ_TIME_ACCOUNTING=y case.

Right.

> 
> > And we have to maintain several workarounds, sometimes even in the fastpath in
> > order to cope with the cputime_t random granularity all over.
> > 
> > Some fastpath examples:
> > 
> > * steal time accounting (need to convert nsecs to cputime then back)
> > * irqtime accounting (maintain accumulators)
> > * cputime_adjust, used on any user read of cputime (need to convert from nsecs
> >   to cputime on cputime_adjust)
> > 
> > But the worst really is about maintainance. This patchset removes around 600 lines.
> 
> Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from
> the generic cputime_jiffies.h and cputime_nsecs.h.

Well, still worth it :-)

> > > The do_account_vtime function is called once per jiffy and once per task
> > > switch. HZ is usually set to 100 for s390, the conversion once per jiffy
> > > would not be so bad, but the call on the scheduling path *will* hurt.
> > 
> > I don't think we need to flush on task switch. If we maintain the accumulators
> > on the task/thread struct instead of per-cpu, then the remaining time after
> > task switch out will be accounted on next tick after after next task switch in.
> 
> You can not properly calculate steal time if you allow sleeping tasks to sit on
> up to 5*HZ worth of cpu time.

Ah, you mean that when the task goes to sleep, we shouldn't miss more than one
tick worth of system/user time but the steal time can be much higher, right?

> I think we *have* to do accounting on task switch.
> At least on s390, likely on powerpc as well. Why not make that an option for
> the architecture with the yet-to-be-written accumulating code.

Ok, how about doing the accumulation and always account on task switch for now,
we'll see later if it's worth having such an option.

> 
> > > What is even worse is the vtime_account_irq_enter path, that is call several
> > > times for each *interrupt*, at least two times for an interrupt without
> > > additional processing and four times if a softirq is triggered.
> > 
> > Actually maintaining an accumulator to flush on ticks is probably going to increase
> > the perf because of that. account_system_time() is called twice per interrupt, and
> > such function do much more than just account the time to the task_struct and cpustat
> > fields. The same applies to userspace boundaries and context switch. The account_*_time()
> > functions can be expensive.
> 
> The account_system_time twice per interrupt can be removed with the accumulation
> idea. We will have to see how expensive the accounting_xxx_time calls are on
> the context switch path.

Right.

> 
> > > 
> > > Now it has been proposed to implement lazy accounting to accumulate deltas
> > > and do the expensive conversions only infrequently. This is pretty straight-
> > > forward for account_user_time but to do this for the account_system_time
> > > function is more complicated. The function has to differentiate between
> > > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > > each bucket and provide a separate function to add to each bucket. Like
> > > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > > account_system_time(). Then it is up to the arch code to sort out the details
> > > and call the accounting code once per jiffy for each of the buckets.
> > 
> > That wouldn't be too hard really. The s390 code in vtime.c already does that.
> 
> Yes, I agree that the accumulating change would not be too hard. Can I make the
> request that we try to get that done first before doing the cleanup ?

Of course. I see you started something, I'll be glad to help!

>  
> > > We still have to do the whole thing on each task switch though.
> > 
> > Not if we maintain the deltas in the task_struct.
> > 
> > > 
> > > But I am still not happy about the approach. What is the compelling reason
> > > for this change except for the "but it looks ugly"?
> > 
> > The diffstat (600 lines removed). Also the fact that we have all these workarounds
> > in the core code just for the special case of 1 arch (s390) and a half
> > (powerpc with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE).
> > 
> > I'd much rather have all that complexity moved in a vtime_native.c shared by s390 and powerpc
> > that takes care of proper accumulation in cputime_t and flushes that on ticks in nsecs rather
> > than having all these cputime_t game all over the kernel.
> 
> The goal to have nano-seconds only in the core code is a good one. And with the
> accumulator I think s390 can live with it. The change would have a real upside
> too. There are these stupid divisions for scaled cputime that we have to calculate
> for every call to account_xxx_time(). These would not be done for the interrupts
> anymore.

Exactly!


Thanks.