[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160930094905.GA22867@dhcp22.suse.cz>
Date: Fri, 30 Sep 2016 11:49:05 +0200
From: Michal Hocko <mhocko@...nel.org>
To: Rik van Riel <riel@...hat.com>
Cc: Minchan Kim <minchan@...nel.org>,
Sonny Rao <sonnyrao@...omium.org>, Jann Horn <jann@...jh.net>,
Robert Foss <robert.foss@...labora.com>, corbet@....net,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>,
Konstantin Khlebnikov <koct9i@...il.com>,
Hugh Dickins <hughd@...gle.com>,
Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
John Stultz <john.stultz@...aro.org>,
ross.zwisler@...ux.intel.com, jmarchan@...hat.com,
Johannes Weiner <hannes@...xchg.org>,
Kees Cook <keescook@...omium.org>,
Al Viro <viro@...iv.linux.org.uk>,
Cyrill Gorcunov <gorcunov@...nvz.org>,
Robin Humble <plaguedbypenguins@...il.com>,
David Rientjes <rientjes@...gle.com>,
eric.engestrom@...tec.com, Janis Danisevskis <jdanis@...gle.com>,
calvinowens@...com, Alexey Dobriyan <adobriyan@...il.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
ldufour@...ux.vnet.ibm.com, linux-doc@...r.kernel.org,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Ben Zhang <benzh@...omium.org>,
Bryan Freed <bfreed@...omium.org>,
Filipe Brandenburger <filbranden@...omium.org>,
Mateusz Guzik <mguzik@...hat.com>,
Mel Gorman <mgorman@...e.de>,
Mike Galbraith <mgalbraith@...e.com>
Subject: Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3]
Implement /proc/<pid>/totmaps)
[CC Mike and Mel as they have seen some accounting oddities
when doing performance testing. They can share details but
essentially the system time just gets too high]
For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.GL23577@dhcp22.suse.cz
I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.45
System time (seconds): 0.82
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
User time (seconds): 0.47
System time (seconds): 1.55
Percent of CPU this job got: 107%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88
This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.
$ perf report | grep kernel.vmlinux
2.01% as [kernel.vmlinux] [k] page_fault
0.59% cc1 [kernel.vmlinux] [k] page_fault
0.15% git [kernel.vmlinux] [k] page_fault
0.12% bash [kernel.vmlinux] [k] page_fault
0.11% sh [kernel.vmlinux] [k] page_fault
0.08% gcc [kernel.vmlinux] [k] page_fault
0.06% make [kernel.vmlinux] [k] page_fault
0.04% rm [kernel.vmlinux] [k] page_fault
0.03% ld [kernel.vmlinux] [k] page_fault
0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64
0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64
0.01% cat [kernel.vmlinux] [k] page_fault
0.01% collect2 [kernel.vmlinux] [k] page_fault
0.00% sh [kernel.vmlinux] [k] entry_SYSCALL_64
0.00% rm [kernel.vmlinux] [k] entry_SYSCALL_64
0.00% grep [kernel.vmlinux] [k] page_fault
doesn't show anything unexpected.
Original Rik's reply follows:
On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> >
> > And just to give Rik more context. While debugging overhead of the
> > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> >
> > root@...t1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@...t1:~# ./max_map
> > pid:2990 maps:65515
> > root@...t1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> > User time (seconds): 0.00
> > System time (seconds): 0.45
> > Percent of CPU this job got: 98%
> >
>
> > root@...t1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> > User time (seconds): 0.18
> > System time (seconds): 0.29
> > Percent of CPU this job got: 97%
>
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
>
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
>
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
>
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
>
> Does the workload consistently get zero user
> time?
>
> If so, we need to dig further to see under
> what precise circumstances that happens.
>
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
>
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
> Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
> User time (seconds): 0.64
> System time (seconds): 0.19
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83
>
> The main difference between your and my
> NO_HZ config seems to be that NO_HZ_FULL
> is set here. However, it is not enabled
> at run time, so both of our systems
> should only really get NO_HZ_IDLE
> effectively.
>
> Running tasks should get sampled with the
> regular timer tick, while they are running.
>
> In other words, vtime accounting should be
> disabled in both of our tests, for everything
> except the idle task.
>
> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?
>
>
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists