linux-kernel - Re: Perf regression after enabling nvme autonomous power state transitions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CALCETrWJhR6oiGDTLh-C-qteeCsqicKsVJ12o3qzBH=+PkdWMg@mail.gmail.com>
Date:   Mon, 20 Mar 2017 14:23:27 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>
Cc:     Andy Lutomirski <luto@...nel.org>, Jens Axboe <axboe@...com>,
        Christoph Hellwig <hch@....de>,
        LKML <linux-kernel@...r.kernel.org>,
        Chris Wilson <chris@...is-wilson.co.uk>
Subject: Re: Perf regression after enabling nvme autonomous power state transitions

On Mon, Mar 20, 2017 at 4:14 AM, Tvrtko Ursulin
<tvrtko.ursulin@...ux.intel.com> wrote:
>
> On 17/03/2017 15:26, Andy Lutomirski wrote:
>>
>> On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin
>> <tvrtko.ursulin@...ux.intel.com> wrote:
>>>
>>>
>>> Hi Andy, all,
>>>
>>> I have bisected and verified an interesting performance regression caused
>>> by
>>> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous
>>> power state transitions".
>>>
>>> Having that patch or not accounts for approx. 3% perf difference in a
>>> test
>>> which is, and this is the best part, not even i/o bound by any stretch of
>>> the imagination.
>>
>>
>> Weird.  Is there something in the workload that is periodically
>> reading or otherwise doing a synchronous IO?  APST shouldn't affect
>> throughput in any meaningful way, but it certainly affects latency of
>> sporadic IO.
>
>
> Not as far as I can see. All output files are opened in a most basic way and
> there are no sync sycalls of any sort in the trace. Reads are all coming
> from page cache.
>
> I did look more closely at the I/O stats (I use "vmstat 1" for this), and it
> looks that there is some I/O happening during the runtime. I was surprised
> by that since the total data written out is only ~14MiB as I said. (And
> those are the only writes the test does, apart from stdout/err at beginning
> and end.)
>
> I tried increasing dirty_*_bytes values to some huge ones to try to
> eliminate any potential background or direct writeback during the test run
> but that did not seem to help. I could still see a tiny bit of I/O, few KiB
> here and there occasionally, and then a large 14MiB dump at the end of a
> test run.
>
> Any ideas on how to figure out who is exactly doing these writes and why?
>

You could try playing with iosnoop or blktrace.

>
>>> The test is multi-process with overall medium CPU usage and high GPU
>>> (Intel)
>>> usage. Average runtime is around 13 seconds during which it writes out
>>> around 14MiB of data.
>>>
>>> It does so in chunks during the whole runtime but doesn't do anything
>>> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is
>>> all
>>> written to the device only the end of the test run in one chunk. Via the
>>> background write out I suspect.
>>>
>>> The 3% mentioned earlier translates to approx. 430ms longer average
>>> runtime
>>> with the above patch.
>>
>>
>> That is surprisingly long.  If your device is telling the truth about
>> its latencies, this should be literally impossible.
>>
>>> ps    0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>>>           rwt:0 rwl:0 idle_power:- active_power:-
>>
>>
>> This is the normal on state.
>>
>>> ps    1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
>>>           rwt:1 rwl:1 idle_power:- active_power:-
>>> ps    2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
>>>           rwt:2 rwl:2 idle_power:- active_power:-
>>
>>
>> These won't be used at all.
>>
>>> ps    3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
>>>           rwt:3 rwl:3 idle_power:- active_power:-
>>> ps    4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
>>>           rwt:4 rwl:4 idle_power:- active_power:-
>>
>>
>> These two have their latencies specified in microseconds and will be
>> used autonomously.
>>
>> Can you re-test with
>> /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the
>> following values and report back?
>>
>> 0 - will turn APST off entirely
>> 1 - will leave APST on but will not program any transitions
>> 11000 - should allow state 3 but not state 4
>
>
> 11000 seems to be enough to bring the results back to expected. 1 and 0 seem
> to not have any further effect.
>
> I've also tried making the test output to /dev/null and that had the same
> beneficial result, even with the default "25000" setting for
> pm_qos_latency_tolerance.

My best guess is that your disk lies and the latency in ps 4 is just
extraordinarily high.

--Andy