lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <85b22e16-fffc-7ebe-2ab1-3b6fe7e036ab@linux.intel.com>
Date:   Fri, 15 Feb 2019 12:35:51 +0800
From:   "Li, Aubrey" <aubrey.li@...ux.intel.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     mingo@...hat.com, peterz@...radead.org, hpa@...or.com,
        ak@...ux.intel.com, tim.c.chen@...ux.intel.com,
        dave.hansen@...el.com, arjan@...ux.intel.com, aubrey.li@...el.com,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v11 2/3] x86,/proc/pid/status: Add AVX-512 usage elapsed
 time

On 2019/2/14 19:29, Thomas Gleixner wrote:
> On Wed, 13 Feb 2019, Aubrey Li wrote:
> 
>> AVX-512 components use could cause core turbo frequency drop. So
>> it's useful to expose AVX-512 usage elapsed time as a heuristic hint
>> for the user space job scheduler to cluster the AVX-512 using tasks
>> together.
>>
>> Example:
>> $ cat /proc/pid/status | grep AVX512_elapsed_ms
>> AVX512_elapsed_ms:      1020
>>
>> The number '1020' denotes 1020 millisecond elapsed since last time
>> context switch the off-CPU task using AVX-512 components, thus the
> 
> I know what you are trying to say, but this sentence does not parse. So
> what you want to say is:
> 
>   This means that 1020 milliseconds have elapsed since the AVX512 usage of
>   the task was detected when the task was scheduled out.

Thanks, will refine this.

> 
> Aside of that 1020ms is hardly evidence for real AVX512 usage, so you want
> to come up with a better example than that.

Oh, I wrote a simple benchmark to loop {AVX ops a while and non-AVX ops a while},
So this is expected. Yeah, I should use real AVX512 usage. Below is tensorflow
output to train a neural network model to classify images (HZ = 250 on my side).
Will change to use this example.

$ while [ 1 ]; do cat /proc/83226/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	12
AVX512_elapsed_ms:	0
AVX512_elapsed_ms:	16
AVX512_elapsed_ms:	4
AVX512_elapsed_ms:	0
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	8
AVX512_elapsed_ms:	4

> 
> But that makes me think about the usefulness of this hint in general.
> 
> A AVX512 using task which runs alone on a CPU, is going to have either no
> AVX512 usage recorded at all or the time elapsed since the last recording
> is absurdly long.

I did an experiment of this, please correct me if I was wrong.

I isolate CPU103, and run a AVX512 micro benchmark(spin AVX512 ops) on it.

$ cat /proc/cmdline 
root=UUID=e6503b72-57d7-433a-ab09-a4b9a39e9128 ro isolcpus=103

I still saw context switch
aubrey@...rey-skl:~$ sudo trace-cmd report --cpu 103
cpus=104
        avx_demo-6985  [103]  5055.442432: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5055.442434: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5059.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5059.442432: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5063.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5063.442431: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]
        avx_demo-6985  [103]  5067.442430: sched_switch:         avx_demo:6985 [120] R ==> migration/103:527 [0]
   migration/103-527   [103]  5067.442431: sched_switch:         migration/103:527 [0] S ==> avx_demo:6985 [120]

It looks like some kernel threads still participant context switch on the isolated
CPU, like above one, each CPU has one migration daemon to do migration jobs.

Under this scenario, the elapsed time becomes longer than normal indeed, see below:

$ while [ 1 ]; do cat /proc/6985/status | grep AVX; sleep 1; done
AVX512_elapsed_ms:	3432
AVX512_elapsed_ms:	440
AVX512_elapsed_ms:	1444
AVX512_elapsed_ms:	2448
AVX512_elapsed_ms:	3456
AVX512_elapsed_ms:	460
AVX512_elapsed_ms:	1464
AVX512_elapsed_ms:	2468

But AFAIK, google's Heracles do a 15s polling, so this worst case is still acceptable.?

>IOW, this needs crystal ball magic to decode because
> there is no correlation between that elapsed time and the time when the
> last context switch happened simply because that time is not available in
> /proc/$PID/status. Sure you can oracle it out from /proc/$PID/stat with
> even more crystal ball magic, but there is no explanation at all.
> 
> There may be use case scenarios where this crystal ball prediction is
> actually useful, but the inaccuracy of that information and the possible
> pitfalls for any user space application which uses it need to be documented
> in detail. Without that, this is going to cause more trouble and confusion
> than benefit.
> 
Not sure if the above experiment addressed your concern, please correct me if
I totally misunderstood.

Thanks,
-Aubrey

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ