linux-kernel - Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4e5e476b0910040215m35af5c99pf2c3a463a5cb61dd@mail.gmail.com>
Date:	Sun, 4 Oct 2009 11:15:24 +0200
From:	Corrado Zoccolo <czoccolo@...il.com>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Valdis.Kletnieks@...edu, Mike Galbraith <efault@....de>,
	Jens Axboe <jens.axboe@...cle.com>,
	Ingo Molnar <mingo@...e.hu>,
	Ulrich Lukas <stellplatz-nr.13a@...enparkplatz.de>,
	linux-kernel@...r.kernel.org,
	containers@...ts.linux-foundation.org, dm-devel@...hat.com,
	nauman@...gle.com, dpshah@...gle.com, lizf@...fujitsu.com,
	mikew@...gle.com, fchecconi@...il.com, paolo.valente@...more.it,
	ryov@...inux.co.jp, fernando@....ntt.co.jp, jmoyer@...hat.com,
	dhaval@...ux.vnet.ibm.com, balbir@...ux.vnet.ibm.com,
	righi.andrea@...il.com, m-ikeda@...jp.nec.com, agk@...hat.com,
	akpm@...ux-foundation.org, peterz@...radead.org,
	jmarchan@...hat.com, torvalds@...ux-foundation.org, riel@...hat.com
Subject: Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler 
	based IO controller V10)

Hi Vivek,
On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@...hat.com> wrote:
> On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
>> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@...hat.com> wrote:
>> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> >> In fact I think that the 'rotating' flag name is misleading.
>> >> All the checks we are doing are actually checking if the device truly
>> >> supports multiple parallel operations, and this feature is shared by
>> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> >> NCQ-enabled SATA disk.
>> >>
>> >
>> > While we are at it, what happens to notion of priority of tasks on SSDs?
>> This is not changed by proposed patch w.r.t. current CFQ.
>
> This is a general question irrespective of current patch. Want to know
> what is our statement w.r.t ioprio and what it means for user? When do
> we support it and when do we not.
>
>> > Without idling there is not continuous time slice and there is no
>> > fairness. So ioprio is out of the window for SSDs?
>> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
>> me that the way in which queues are sorted in the rr tree may still
>> provide some sort of fairness and service differentiation for
>> priorities, in terms of number of IOs.
>
> I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> not. I guess this happens because sometimes idling is enabled and sometmes
> not because of dyanamic nature of hw_tag.
>
My guess is that the formula that is used to handle this case is not
very stable.
The culprit code is (in cfq_service_tree_add):
        } else if (!add_front) {
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
        } else

cfq_slice_offset is defined as:

static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
                                      struct cfq_queue *cfqq)
{
        /*
         * just an approximation, should be ok.
         */
	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
                       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

Can you try changing the latter to a simpler (we already observed that
busy_queues is unstable, and I think that here it is not needed at
all):
	return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
and remove the 'rb_key += cfqq->slice_resid; ' from the former.

This should give a higher probability of being first on the queue to
larger slice tasks, so it will work if we don't idle, but it needs
some adjustment if we idle.

> I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> third prio7.
>
> (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
>
> Note there is almost no difference between prio 0 and prio 4 job and prio7
> job has been penalized heavily (gets less than 10% BW of prio 4 job).
>
>> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
>> is not an issue for them.
>
> Agree.
>
>> >
>> > On SSDs, will it make more sense to provide fairness in terms of number or
>> > IO or size of IO and not in terms of time slices.
>> Not on all SSDs. There are still ones that have a non-negligible
>> penalty on non-sequential access pattern (hopefully the ones without
>> NCQ, but if we find otherwise, then we will have to benchmark access
>> time in I/O scheduler to select the best policy). For those, time
>> based may still be needed.
>
> Ok.
>
> So on better SSDs out there with NCQ, we probably don't support the notion of
> ioprio? Or, I am missing something.

I think we try, but the current formula is simply not good enough.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@...il.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/