linux-kernel - Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160930094905.GA22867@dhcp22.suse.cz>
Date:   Fri, 30 Sep 2016 11:49:05 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Rik van Riel <riel@...hat.com>
Cc:     Minchan Kim <minchan@...nel.org>,
        Sonny Rao <sonnyrao@...omium.org>, Jann Horn <jann@...jh.net>,
        Robert Foss <robert.foss@...labora.com>, corbet@....net,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Konstantin Khlebnikov <koct9i@...il.com>,
        Hugh Dickins <hughd@...gle.com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        John Stultz <john.stultz@...aro.org>,
        ross.zwisler@...ux.intel.com, jmarchan@...hat.com,
        Johannes Weiner <hannes@...xchg.org>,
        Kees Cook <keescook@...omium.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Cyrill Gorcunov <gorcunov@...nvz.org>,
        Robin Humble <plaguedbypenguins@...il.com>,
        David Rientjes <rientjes@...gle.com>,
        eric.engestrom@...tec.com, Janis Danisevskis <jdanis@...gle.com>,
        calvinowens@...com, Alexey Dobriyan <adobriyan@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        ldufour@...ux.vnet.ibm.com, linux-doc@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Ben Zhang <benzh@...omium.org>,
        Bryan Freed <bfreed@...omium.org>,
        Filipe Brandenburger <filbranden@...omium.org>,
        Mateusz Guzik <mguzik@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Mike Galbraith <mgalbraith@...e.com>
Subject: Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3]
 Implement /proc/<pid>/totmaps)

[CC Mike and Mel as they have seen some accounting oddities
 when doing performance testing. They can share details but
 essentially the system time just gets too high]

For your reference the email thread started
http://lkml.kernel.org/r/20160823143330.GL23577@dhcp22.suse.cz

I suspect this is mainly for short lived processes - like kernel compile
$ /usr/bin/time -v make mm/mmap.o
[...]
        User time (seconds): 0.45
        System time (seconds): 0.82
        Percent of CPU this job got: 111%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14
$ rm mm/mmap.o
$ /usr/bin/time -v make mm/mmap.o
[...]
        User time (seconds): 0.47
        System time (seconds): 1.55
        Percent of CPU this job got: 107%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88

This is quite unexpected for a cache hot compile. I would expect most of
the time being spent in userspace.

$ perf report | grep kernel.vmlinux
     2.01%  as        [kernel.vmlinux]         [k] page_fault
     0.59%  cc1       [kernel.vmlinux]         [k] page_fault
     0.15%  git       [kernel.vmlinux]         [k] page_fault
     0.12%  bash      [kernel.vmlinux]         [k] page_fault
     0.11%  sh        [kernel.vmlinux]         [k] page_fault
     0.08%  gcc       [kernel.vmlinux]         [k] page_fault
     0.06%  make      [kernel.vmlinux]         [k] page_fault
     0.04%  rm        [kernel.vmlinux]         [k] page_fault
     0.03%  ld        [kernel.vmlinux]         [k] page_fault
     0.02%  bash      [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.01%  git       [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.01%  cat       [kernel.vmlinux]         [k] page_fault
     0.01%  collect2  [kernel.vmlinux]         [k] page_fault
     0.00%  sh        [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.00%  rm        [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.00%  grep      [kernel.vmlinux]         [k] page_fault

doesn't show anything unexpected.

Original Rik's reply follows:

On Tue 23-08-16 17:46:11, Rik van Riel wrote:
> On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote:
[...]
> > OK, so it seems I found it. I was quite lucky because
> > account_user_time
> > is not all that popular function and there were basically no changes
> > besides Riks ff9a9b4c4334 ("sched, time: Switch
> > VIRT_CPU_ACCOUNTING_GEN
> > to jiffy granularity") and that seems to cause the regression.
> > Reverting
> > the commit on top of the current mmotm seems to fix the issue for me.
> > 
> > And just to give Rik more context. While debugging overhead of the
> > /proc/<pid>/smaps I am getting a misleading output from /usr/bin/time
> > -v
> > (source for ./max_mmap is [1])
> > 
> > root@...t1:~# uname -r
> > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334
> > root@...t1:~# ./max_map 
> > pid:2990 maps:65515
> > root@...t1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps
> > rss:263368 pss:262203
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps"
> >         User time (seconds): 0.00
> >         System time (seconds): 0.45
> >         Percent of CPU this job got: 98%
> > 
> 
> > root@...t1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2}
> > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps
> > rss:263316 pss:262199
> >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps"
> >         User time (seconds): 0.18
> >         System time (seconds): 0.29
> >         Percent of CPU this job got: 97%
> 
> The patch in question makes user and system
> time accounting essentially tick-based. If
> jiffies changes while the task is in user
> mode, time gets accounted as user time, if
> jiffies changes while the task is in system
> mode, time gets accounted as system time.
> 
> If you get "unlucky", with a job like the
> above, it is possible all time gets accounted
> to system time.
> 
> This would be true both with the system running
> with a periodic timer tick (before and after my
> patch is applied), and in nohz_idle mode (after
> my patch).
> 
> However, it does seem quite unlikely that you
> get zero user time, since you have 125 timer
> ticks in half a second. Furthermore, you do not
> even have NO_HZ_FULL enabled...
> 
> Does the workload consistently get zero user
> time?
> 
> If so, we need to dig further to see under
> what precise circumstances that happens.
> 
> On my laptop, with kernel 4.6.3-300.fc24.x86_64
> I get this:
> 
> $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf
> "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps
> rss:263368 pss:262145
> 	Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END
> {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps"
> 	User time (seconds): 0.64
> 	System time (seconds): 0.19
> 	Percent of CPU this job got: 99%
> 	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83
> 
> The main difference between your and my
> NO_HZ config seems to be that NO_HZ_FULL
> is set here. However, it is not enabled
> at run time, so both of our systems
> should only really get NO_HZ_IDLE
> effectively.
> 
> Running tasks should get sampled with the
> regular timer tick, while they are running.
> 
> In other words, vtime accounting should be
> disabled in both of our tests, for everything
> except the idle task.
> 
> Do I need to do anything special to reproduce
> your bug, besides running the max mmap program
> and the awk script?
> 
> 

-- 
Michal Hocko
SUSE Labs