lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f697c699-e2d7-49d2-3cf6-235cdb60811b@brocade.com>
Date:   Wed, 2 Nov 2016 05:16:03 -0400
From:   "Charles (Chas) Williams" <ciwillia@...cade.com>
To:     Sebastian Andrzej Siewior <bigeasy@...utronix.de>
CC:     <linux-kernel@...r.kernel.org>, <rt@...utronix.de>
Subject: Re: [PREEMPT-RT] Oops in rapl_cpu_prepare()

On 10/28/2016 04:03 AM, Sebastian Andrzej Siewior wrote:
> On 2016-10-27 15:00:32 [-0400], Charles (Chas) Williams wrote:
>>> I assume "init_rapl_pmus: maxpkg 4" is from init_rapl_pmus() returning
>>> topology_max_packages(). So it says 4 but then returns 65535 for CPU 2
>>> and 3. That -1 comes probably from topology_update_package_map(). Could
>>> you please send a complete boot log and try the following patch? This
>>> one should fix your boot problem and disable RAPL if the info is
>>> invalid.
>>
>> But sometimes the topology info is correct and if I get lucky, the
>> package id could be valid for all the CPU's.  Given the behavior,
>> I have seen so far it makes me thing the RAPL isn't being emulated.
>> So even if I did boot onto a "valid" set of cores, would I always be
>> certain that I will be on those cores?
>
> I don't what vmware does here. Nor do they ship source to check. So if
> you have a big HW box with say two packages, it might make sense to give
> this information to the guest _if_ the CPUs are pinned and the guest
> never migrates.

Yes, I agree _if_.  That's why it simply isn't clear to me that we should
attempt do any RAPL at all for VMWare.  The current behavior doesn't seem
to make sense and I don't expect it to suddenly start acting reasonable.
Since I don't understand why some package id's are valid and others
are not, I would prefer not to trust any of the information as far as
enabling/disabling the RAPL monitoring.

>
>> Per your request in your next email:
>>
>>> One thing I forgot to ask: Could you please check if you get the same
>>> pkgid reported for cpu 0-3 on a pre-v4.8 kernel? (before the hotplug
>>> rework).
>>
>> Our previous kernel was 4.4, and didn't use the logical package id:
> I see.
>
> Did the patch I sent fixed it for you and were you not able to test?

Yes, it does prevent RAPL from starting and loading.  From the boot log:

[    2.711481] RAPL PMU: rapl pmu error: max package: 4 but CPU2 belongs to 65535
[    2.711639] rapl pmu error: max package: 4 but CPU2 belongs to 65535

This was consistent across several reboots.  I poked around in the
VM settings.  Apparently this guest is configured for four virtual
sockets with one core per socket.  Testing with two virtual sockets,
one core per socket:

[    2.163177] RAPL PMU: rapl pmu error: max package: 2 but CPU1 belongs to 65535
[    2.163304] rapl pmu error: max package: 2 but CPU1 belongs to 65535

Booting with 1 virtual socket, 1 core per socket:

[    1.750311] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[    1.750312] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[    1.750313] RAPL PMU: hw unit of domain package 2^-0 Joules
[    1.750314] RAPL PMU: hw unit of domain dram 2^-0 Joules

Booting with 1 virtual socket, 4 cores per socket:

[    3.527298] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 10737418240 ms ovfl timer
[    3.527302] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[    3.527304] RAPL PMU: hw unit of domain package 2^-0 Joules
[    3.527307] RAPL PMU: hw unit of domain dram 2^-0 Joules

So, it looks like VMWare tends to always get something wrong if you have
more than one virtual socket.  The above behavior was consistent across
several reboots.



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ