linux-kernel - Re: [PATCH 1/2] virtio-scsi: first version

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4EE0B712.7000006@redhat.com>
Date:	Thu, 08 Dec 2011 14:09:38 +0100
From:	Paolo Bonzini <pbonzini@...hat.com>
To:	James Bottomley <James.Bottomley@...senPartnership.com>
CC:	linux-kernel@...r.kernel.org,
	"Michael S. Tsirkin" <mst@...hat.com>,
	linux-scsi <linux-scsi@...r.kernel.org>,
	Rusty Russell <rusty@...tcorp.com.au>,
	Stefan Hajnoczi <stefanha@...ux.vnet.ibm.com>
Subject: Re: [PATCH 1/2] virtio-scsi: first version

On 12/07/2011 03:35 PM, James Bottomley wrote:
> On Wed, 2011-12-07 at 10:41 +0100, Paolo Bonzini wrote:
>> On 12/06/2011 07:09 PM, James Bottomley wrote:
>>> On Mon, 2011-12-05 at 18:29 +0100, Paolo Bonzini wrote:
>>>> The virtio-scsi HBA is the basis of an alternative storage stack
>>>> for QEMU-based virtual machines (including KVM).
>>>
>>> Could you clarify what the problem with virtio-blk is?
>>
>> In a nutshell, if virtio-blk had no problems, then you could also throw
>> away iSCSI and extend NBD instead. :)
>
> Um, I wouldn't make that as an argument.  For a linux only transport,
> nbd is far better than iSCSI mainly because it's a lot simpler and
> easier and doesn't have a tied encapsulation ... it is chosen in a lot
> of implementations for that reason.

Indeed virtio-blk is not going to disappear overnight.

>> The main problem is that *every* new feature requires updating three or
>> more places: the spec, the host (QEMU), and the guest drivers (at least
>> two: Linux and Windows).  Exposing the new feature also requires
>> updating all the hosts, but also all the guests.
>
> Define "new feature"; you mean the various request types for flush and
> discard?

So far the feature bits that had to be added was barrier (now 
deprecated), maximum request size, maximum segments/request, geometry 
information (chs, for BIOS boot), read-only, total size, SCSI requests, 
flush requests, WCE, topology (aka block limits).  WCE and topology 
actually are in the code but not in the virtio spec.  For each of these, 
both the host and the guest drivers had to be updated.

These still do not cover discard (and secure discard), bidirectional 
SG_IO, and perhaps something for removable media. (*)  Any future 
extension of course will also require updating the host and guest 
drivers (plus the spec).

     (*) I mention removable media because one of two usecases I know
         for SG_IO on virtio-blk is burning CDs.

At some point, it makes sense to rethink the protocol.  virtio-scsi is 
substantially saner in this respect; it requires 1/3 of the work to 
implement a new feature, and especially frees us from having to define 
another spec specially for virtualization.  This is why I listed 
extensibility as part of the goals for virtio-scsi.

>> With virtio-scsi, the host device provides nothing but a SCSI transport.
>>    You still have to update everything (spec+host+guest) when something
>> is added to the SCSI transport, but that's a pretty rare event.
>
> Well, no it's not, the transports are the fastest evolving piece of the
> SCSI spec.

No, I mean when something is added to the generic definition of SCSI 
transport (SAM, more or less), not the individual transports.  When the 
virtio-scsi transport has to change, you still have to update 
spec+host+guest, but that's relatively rare.

>> In the most common case, there is a feature that the guest already
>> knows about, but that QEMU does not implement (for example a
>> particular mode page bit). Once the host is updated to expose the
>> feature, the guest picks it up automatically.
>
> That's in the encapsulation, surely; these are used to set up the queue,
> so only the queue runner (i.e. the host) needs to know.

Not at all.  You can start the guest in writethrough-cache mode.  Then, 
guests that know how to do flush+FUA can enable writeback for 
performance.  There's nothing virtio-blk or virtio-scsi specific in 
this.  But in virtio-scsi you only need to update the host.  In 
virtio-blk you need to update the guest and spec too.

> I don't get this.  If you have a file backed SCSI device, you have to
> interpret the MODE_SELECT command on the transport.  How is that any
> different from unwrapping the SG_IO picking out the MODE_SELECT and
> interpreting it?

The difference is that virtio-scsi exposes a direct-access SCSI device, 
nothing less nothing more.  virtio-blk exposes a disk that has nothing 
to do with SCSI except that it happens to understand SG_IO; the primary 
means for communication are the virtio-blk config space and read/write 
requests.

So, for virtio-blk, SG_IO is good for persistent reservations, burning 
CDs, and basically nothing else.  Neither of these can really be done in 
the host by interpreting, so for virtio-blk it makes sense to simply 
pass through.

For virtio-scsi, the SCSI command set is how you communicate with the 
host, and you don't care about who ends up interpreting the commands: it 
can be local or remote, userspace or kernelspace, a server or a disk, 
you don't care.

So, QEMU is already (optionally) doing interpretation for virtio-scsi. 
It's not for virtio-blk, and it's not going to.

>> Regarding passthrough, non-block devices and task management functions
>> cannot be passed via virtio-blk.  Lack of TMFs make virtio-blk's error
>> handling less than optimal in the guest.
>
> This would be presumably because most of the errors (i.e. the transport
> ones) are handled in the host.  All the guest has to do is pass on the
> error codes the host gives it.
>
> You worry me enormously talking about TMFs because they're transport
> specific.

True, but virtio-blk for example cannot even retry a command at all.

>> It doesn't really matter if it is exclusive or not (it can be
>> non-exclusive with NPIV or iSCSI in the host; otherwise it pretty much
>> has to be exclusive, because persistent reservations do not work).  The
>> important point is that it's at the LUN level rather than the host level.
>
> virtio-blk can pass through at the LUN level surely: every LUN (in fact
> every separate SCSI device) has a separate queue.

virtio-blk isn't meant to do pass through.  virtio-blk had SG_IO bolted 
on it, but this doesn't mean that the guest /dev/vdX is equivalent to 
the host's /dev/sdY.  From kernelspace, features are lacking: no WCE 
toggle, no thin provisioning, no extended copy, etc.  From userspace, 
your block size might be screwed up or worse.  With virtio-scsi, by 
definition the guest /dev/sdX can be as capable as the host's /dev/sdY 
if you ask the host to do passthrough.

>> There are other possible uses, where the target is on the host.  QEMU
>> itself can act as the target, or you can use LIO with FILEIO or IBLOCK
>> backends.
>
> If you use an iSCSI back end, why not an iSCSI initiator.  They may be
> messy but at least the interaction is defined and expected rather than
> encapsulated like you'd be doing with virtio-scsi.

If you use an iSCSI initiator, you need to expose to the guest the 
details of your storage, including possibly the authentication.

I'm not sure however if you interpreted LIO as LIO's iSCSI backend.  In 
that case, note that a virtio-scsi backend for LIO is in the works too.

> so I agree, supporting REQ_DISCARD are host updates because they're an
> expansion of the block protocol.  However, they're rare, and, as you
> said, you have to update the emulated targets anyway.

New features are rare, but there are also features where virtio-blk is 
lagging behind, and those aren't necessarily rare.

Regarding updates to the targets, you have much more control on the host 
than the guest.  Updating the host is trivial compared to updating the 
guest.

> Incidentally, REQ_DISCARD was added in 2008.  In that time close to
> 50 new commands have been added to SCSI, so the block protocol is
> pretty slow moving.

That also means that virtio-blk cannot give guests access to the full 
range of features that might want to use.  Not all OSes are Linux, not 
all OSes limit themselves to the features of the Linux block protocol.

>> Not to mention that virtio-blk does I/O in units of 512 bytes.  It
>> supports passing an arbitrary logical block size in the configuration
>> space, but even then there's no guarantee that SG_IO will use the same
>> size.  To use SG_IO, you have to fetch the logical block size with READ
>> CAPACITY.
>
> So here what I think you're telling me is that virtio-blk doesn't have a
> correct discovery protocol?

No, I'm saying that virtio-blk's SG_IO is not meant to be used for 
configuration, I/O or discovery.  If you want to use it for those tasks, 
and it breaks, you're on your own.  virtio-blk lets you show a 
4k-logical-block disk as having 512b logical blocks, for example because 
otherwise you could not boot from it; however, as soon as you use SG_IO 
the truth shows.  The answer is "don't do it", but can be a severe 
limitation.

>>> I'm not familiar necessarily with the problems of QEMU devices, but
>>> surely it can unwrap the SG_IO transport generically rather than
>>> having to emulate on a per feature basis?
>>
>> QEMU does interpret virtio-blk's SG_IO just by passing down the ioctl.
>> With the virtio-scsi backend you can choose between doing so or
>> emulating everything.
>
> So why is that choice not available to virto-blk?  surely it could
> interpret after unwrapping the SG_IO encapsulation.

Because if you do this, you get really no advantages.  Userspace uses 
virtio-blk's SG_IO for only a couple of usecases, which hardly apply to 
files.  On the other hand, if you use SPC/SBC as a unified protocol for 
configuration, discovery and I/O, it makes sense to emulate.

> Reading back all of this, I think there's some basic misunderstanding
> somewhere, so let me see if I can make the discussion more abstract.

Probably. :)

> The way we run a storage device today (be it scsi or something else) is
> via a block queue.  The only interaction a user gets is via that queue.
> Therefore, in Linux, slicing the interaction at the queue and
> transporting all the queue commands to some back end produces exactly
> what we have today ...

Let's draw it like this:

              guest       |                    host
                          |
  read() -> req() ---virtio-blk ---> read() -> req -> READ(16) -> device

> now correctly implemented, virtio-blk should do that (and if there
> are problems in the current implementation, I'd rather see them
> fixed), so it should have full equivalency to what a native linux
> userspace sees.

Right: there are missing features I mentioned above, and SG_IO is very 
limited with virtio-blk compared to native, but usually it is fine.  For 
other OSes it is less than ideal, but it can work.  It can be improved 
(not completely fixed), but again at some point, it makes sense to 
rethink the stack.

> Because of the slicing at the top, most of the actual processing,
> including error handling and interpretation goes on in the back end
> (i.e. the host) and anything request based like dm-mp and md (but
> obviously not lvm, which is bio based) ... what I seem to see implied
> but not stated in the above is that you have some reason you want to
> move this into the guest, which is what happens if you slice at a lower
> level (like SCSI)?

Yes, that's what happens if you do passthrough:

              guest                 |            host
                                    |
  read() -> req() -> READ(16) --virtio-scsi ---> ioctl() -> ...

Advantages here include the ability to work with non-block devices, and 
the ability to reuse all the discovery code that is or will be in sd.c. 
  If you do like this and you want multipathing (for example) you indeed 
have to move it into the VM, but it doesn't usually make much sense.

However, something else actually can happen in the host, and here lie 
the interesting cases.  For example, the host userspace can send the 
commands to the LUN via iSCSI, directly:

              guest                 | host with userspace iSCSI initiator
                                    |
  read() -> req() -> READ(16) --virtio-scsi ---> send() -> ...

This is still effectively passthrough, on the other hand it doesn't 
require you to handle low-level details in the VM.  And unlike an iSCSI 
initiator in the guest, you are free to change how the storage is 
implemented.

A third implementation is to emulate SCSI commands by unpacking them in 
host userspace:

              guest                 |            host
                                    |
  read() -> req() -> READ(16) --virtio-scsi ---> read() -> ...

Again, you reuse all the discovery code that is in sd.c, and future 
improvements can be confined to the emulation code only.  In addition, 
future improvements done to sd.c for non-virt will apply to virt as well 
(either right away or modulo emulation improvements).  In addition, 
you're 100% sure that when the guest uses SG_IO it will not exhibit any 
quirks.  And it is also more flexible when your guests are not Linux.

There's nothing new in it.  As far as I know, only Xen has a dedicated 
protocol for paravirtualized block devices (in addition to virtio). 
Hyper-V and VMware both use paravirtualized SCSI.

> One of the problems you might also pick up slicing within SCSI is that
> if (by some miracle, admittedly) we finally disentangle ATA from SCSI,
> you'll lose ATA and SATA support in virtio-scsi.  Today you also loose
> support for non-SCSI block devices like mmc

You do not lose that.  Just like virtio-blk cannot do SG_IO to mmc, 
virtio-scsi is only be usable with mmc in emulated mode.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/