[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <abda51ea-e123-0fdc-c1e6-86f0e33d5591@linux.intel.com>
Date: Mon, 20 Mar 2017 11:14:11 +0000
From: Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Andy Lutomirski <luto@...nel.org>, Jens Axboe <axboe@...com>,
Christoph Hellwig <hch@....de>,
LKML <linux-kernel@...r.kernel.org>,
Chris Wilson <chris@...is-wilson.co.uk>
Subject: Re: Perf regression after enabling nvme autonomous power state
transitions
On 17/03/2017 15:26, Andy Lutomirski wrote:
> On Fri, Mar 17, 2017 at 3:58 AM, Tvrtko Ursulin
> <tvrtko.ursulin@...ux.intel.com> wrote:
>>
>> Hi Andy, all,
>>
>> I have bisected and verified an interesting performance regression caused by
>> commit c5552fde102fcc3f2cf9e502b8ac90e3500d8fdf "nvme: Enable autonomous
>> power state transitions".
>>
>> Having that patch or not accounts for approx. 3% perf difference in a test
>> which is, and this is the best part, not even i/o bound by any stretch of
>> the imagination.
>
> Weird. Is there something in the workload that is periodically
> reading or otherwise doing a synchronous IO? APST shouldn't affect
> throughput in any meaningful way, but it certainly affects latency of
> sporadic IO.
Not as far as I can see. All output files are opened in a most basic way
and there are no sync sycalls of any sort in the trace. Reads are all
coming from page cache.
I did look more closely at the I/O stats (I use "vmstat 1" for this),
and it looks that there is some I/O happening during the runtime. I was
surprised by that since the total data written out is only ~14MiB as I
said. (And those are the only writes the test does, apart from
stdout/err at beginning and end.)
I tried increasing dirty_*_bytes values to some huge ones to try to
eliminate any potential background or direct writeback during the test
run but that did not seem to help. I could still see a tiny bit of I/O,
few KiB here and there occasionally, and then a large 14MiB dump at the
end of a test run.
Any ideas on how to figure out who is exactly doing these writes and why?
>> The test is multi-process with overall medium CPU usage and high GPU (Intel)
>> usage. Average runtime is around 13 seconds during which it writes out
>> around 14MiB of data.
>>
>> It does so in chunks during the whole runtime but doesn't do anything
>> special, just normal O_RDWR | O_CREAT | O_TRUNC so in practice this is all
>> written to the device only the end of the test run in one chunk. Via the
>> background write out I suspect.
>>
>> The 3% mentioned earlier translates to approx. 430ms longer average runtime
>> with the above patch.
>
> That is surprisingly long. If your device is telling the truth about
> its latencies, this should be literally impossible.
>
>> ps 0 : mp:9.00W operational enlat:5 exlat:5 rrt:0 rrl:0
>> rwt:0 rwl:0 idle_power:- active_power:-
>
> This is the normal on state.
>
>> ps 1 : mp:4.60W operational enlat:30 exlat:30 rrt:1 rrl:1
>> rwt:1 rwl:1 idle_power:- active_power:-
>> ps 2 : mp:3.80W operational enlat:30 exlat:30 rrt:2 rrl:2
>> rwt:2 rwl:2 idle_power:- active_power:-
>
> These won't be used at all.
>
>> ps 3 : mp:0.0700W non-operational enlat:10000 exlat:300 rrt:3 rrl:3
>> rwt:3 rwl:3 idle_power:- active_power:-
>> ps 4 : mp:0.0050W non-operational enlat:2000 exlat:10000 rrt:4 rrl:4
>> rwt:4 rwl:4 idle_power:- active_power:-
>
> These two have their latencies specified in microseconds and will be
> used autonomously.
>
> Can you re-test with
> /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us set to the
> following values and report back?
>
> 0 - will turn APST off entirely
> 1 - will leave APST on but will not program any transitions
> 11000 - should allow state 3 but not state 4
11000 seems to be enough to bring the results back to expected. 1 and 0
seem to not have any further effect.
I've also tried making the test output to /dev/null and that had the
same beneficial result, even with the default "25000" setting for
pm_qos_latency_tolerance.
So I suspected unexpected direct write back was happening. But as I
wrote above increasing the dirty_bytes (or dirty_background_bytes, or
the expire controls) did not help. I used 200Mb and 20s for all the
fields, which I assumed would have caused everything to go quiet during
the test run, but it hadn't.
Regards,
Tvrtko
Powered by blists - more mailing lists