linux-kernel - Re: [PATCH v3 0/8] vfio: ccw: Refactoring the VFIO CCW state machine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180626180052.7412ee79.cohuck@redhat.com>
Date:   Tue, 26 Jun 2018 18:00:52 +0200
From:   Cornelia Huck <cohuck@...hat.com>
To:     Pierre Morel <pmorel@...ux.ibm.com>
Cc:     pasic@...ux.vnet.ibm.com, bjsdjshi@...ux.vnet.ibm.com,
        linux-s390@...r.kernel.org, linux-kernel@...r.kernel.org,
        kvm@...r.kernel.org
Subject: Re: [PATCH v3 0/8] vfio: ccw: Refactoring the VFIO CCW state
 machine

On Tue, 26 Jun 2018 13:04:12 +0200
Pierre Morel <pmorel@...ux.ibm.com> wrote:

> On 19/06/2018 16:00, Cornelia Huck wrote:
> > On Thu, 14 Jun 2018 10:06:31 +0200
> > Pierre Morel<pmorel@...ux.ibm.com>  wrote:
> >  
> >> I tried to make a better description to add later in documentation
> >> or in the next cover-letter.
> >>
> >> Note that in the current patch series I did not implement online/offline
> >> events but just kept the previous state changes.
> >> Not sure if it was a good idea, the goal was to be as simple as possible
> >> for this iteration.
> >> If you think it is worth to continue in this direction I will re-introduce
> >> these as events.  
> > I'm still trying to figure out what we want from the state machine.
> > I've tried sketching your fsm proposal as outlined by you below for
> > myself and I have some remarks :)  
> 
> Hi,
> 
> Sorry for the delay in my answer, I was away from my keyboard
> almost the all week :) .

np :)

> 
> >  
> >> ======================
> >>
> >> 1) In CCW VFIO, Finite State Machine is used to clarify the VFIO CCW driver
> >> actions to be take depending on the events occurring in a device.
> >>
> >> To protect the state transitions, a mutex protect the action
> >> started when an event occurs in a specific state.
> >>
> >> The actions must never sleep or keep the mutex a long timer.
> >>
> >> To be able to take the mutex when hardware events occur we start
> >> a work on a dedicated workqueue, posting the event from inside the
> >> workqueue.
> >>
> >> The state machine has very few states describing the device driver life.
> >>
> >> ------------------------------------------------------------
> >> | NOT_OPERATIONAL  | No guest is associated with the device|  
> > I don't think that this is the right description. It is either the
> > initial state, or something has happened that rendered the device
> > inaccessible.  
> 
> The goal of the state machine is to describe the device driver state.
> Not the device state which is hold by the CIO level.

I don't think there really is such a thing as "device driver state" (or
maybe I don't understand what you mean by it). The state is attached to
the individual device, isn't it?

> 
> I think this has lead to some confusion since I tried to keep the old
> naming convention.
> So, I agree that a state named "NON_INIT" or "UNCONFIGUED" would be better
> to distinguish it from the device state "NOT_OPERATIONAL"
> 
> > Also, we need to be careful with the virtual machine vs. guest
> > terminology. The only thing that has an impact from the guest is when
> > it does I/O and when an interrupt is generated as a result (i.e., the
> > IDLE/BUSY transitions). The other transitions are triggered by virtual
> > machine setup.  
> 
> Absolutely.
> 
> >> ------------------------------------------------------------
> >> | STANDBY          | A guest is associated but not started |  
> > This is basically "device is bound to driver, but no mdev has been set
> > up".  
> 
> When the device is bound to the driver, the device driver
> is still in NOT_OPERATIONAL state.
> 
> The transition to STANDBY is done when the virtual machine starts and
> opens the mediated device.
> Note that the device is still not usable until it is reseted.

Hm, isn't that transition happening when the mdev is created?

> 
> >   In my understanding, we need to get the device to IDLE state
> > before the vm can present it to the guest (be it before the machine is
> > up or during hotplug).  
> 
> The virtual machine needs to RESET the device sending the VFIO_RESET
> to the device driver to make the device driver go to the IDLE state.
> 
> >> ------------------------------------------------------------
> >> | IDLE             | Guest started device ready            |  
> > Agree with "device ready", disagree with "guest started" (see above).
> > "Device ready, accepting I/O requests"?  
> 
> Indeed bad description, VM started, and almost as you said
> "Device driver ready, accepting I/O requests for device"
> (Device is ready too)
> 
> >> ------------------------------------------------------------
> >> | BUSY             | The device is busy doing IO           |
> >> ------------------------------------------------------------
> >> | QUIESCING        | The driver is releasing the device    |  
> > Now you are talking about the driver :) This should probably be done
> > for the other states as well.  
> 
> Ill update the description to make clear that the state is the
> driver state (device driver) and not the device state.
> The device state is handled by the firmware.

Now you have managed to confuse me completely... isn't the firmware
only handling the (real) subchannel state?

> 
> > Isn't that state for waiting for outstanding I/O to be terminated
> > before the mdev is destroyed? IOW, the device may stay bound to the
> > driver afterwards?  
> 
> Yes it is.
> 
> >> ------------------------------------------------------------
> >>
> >>
> >> 2) The following Events may apply to the state machine:
> >>
> >> If the event is not described, it means it has no influence
> >> on the state and that no action is required.
> >>
> >> 2.1) FSM in state NOT_OPERATIONAL
> >> ------------------------------------------------------------
> >> | INIT             | a guest will drive the SC             |  
> > Better to write out "subchannel", I could not figure out the
> > abbreviation immediately.  
> 
> OK, right, Ill do.
> 
> > Also, doesn't the event really mean "we're initializing the device"?  
> 
> No, just the device driver.
> 
> > The ultimate intention is of course for a guest to use the device, but
> > the immediate intention is just "get the device through the first
> > initialization steps".
> >  
> >> | |                                       |
> >> |                  | triggered by: mdev_open               |
> >> |                  | action: initialize driver structures  |
> >> |                  | action: register IOMMU notifier       |
> >> |                  | state on success: STANDBY             |
> >> |                  | state on error  : NOT_OPERATIONAL     |
> >> ------------------------------------------------------------
> >>
> >> 2.2) FSM in state STANDBY
> >>
> >> ------------------------------------------------------------
> >> | ONLINE           | The host wants the SC online          |  
> > Isn't that rather "an mdev is created"?  
> 
> No, ONLINE is triggered by VFIO_RESET, after the mdev has been created
> and when the VM is started or restarted by QEMU.
> 
> I see that I did not do a good job describing what triggers the events.
> I will try again in a dedicated document.

It might be good to combine the two.

> 
> >  
> >> | |                                       |
> >> |                  | triggered by: vfio_reset              |
> >> |                  | action: enable SC                     |
> >> |                  | state on success: IDLE                |
> >> |                  | state on error  : STANDBY             |  
> > What happens if the subchannel is not operational when we try to get it
> > ready? Can it go to NOT_OPERATIONAL in that case?  
> 
> I think we have a confusion between the sub-channel being non operational
> and the device driver being non operational.
> Here, the device driver is operational, even the device may not.

How can something be operational if the device is not? It could be in a
state like the ccw device's disconnected state, but it's certainly not
ready for requests.

> 
> For the VM perspective, if the device is not operational we send a RESET.
> 
> For the guest perspective we can do what ever instruction we needs to
> get the device operational, therefor we need the driver to be operational
> to process the instruction.

Ah, do you mean 'subchannel enabled'? I was thinking about 'device
dead, we get cc 3 or somesuch'.

> 
> 
> 
> >  
> >> ------------------------------------------------------------
> >>
> >> Some operations may be intercepted by the state machine but
> >> will not induce a state change:
> >> OFFLINE: re-issue the disabling of the SC  
> > Should that even be possible? If we're still busy tearing it down,
> > shouldn't we be in QUIESCING state?  
> 
> So yes, I think you are right, after a OFFLINE, triggered by
> the VFIO_release the state is QUESCING
> 
> >  
> >> IRQ    : re-issue the disabling of the SC  
> > Either we should have cleared the subchannel out during QUIESCING, or
> > we got an unsolicited interrupt. The latter can be avoided if we make
> > sure that the device is disabled when we go to STANDBY.  
> 
> I will take a new look at cause of the unsolicited interrupts and
> awaited actions.
> 
> >  
> >> 2.3) FSM in state IDLE
> >>
> >> ------------------------------------------------------------
> >> | SSCH             | The guest issue the ssch instruction  |
> >> | |                                       |
> >> |                  | triggered by: vfio_write SSCH=1       |
> >> |                  | action: start an IO request           |
> >> |                  | state on success: BUSY                |
> >> |                  | state on error  : IDLE                |  
> > Should there be any way to drop to NOT_OPERATIONAL as well? We'll
> > probably get a notification from the common I/O layer if that happens,
> > though.  
> 
> I think this is the duty of the guest to handle this kind of thing.
> The device driver must stay operational.

See above. I think we'll get a notification from the common I/O layer,
and it probably makes sense to inject the same notification (CRW) into
the guest.

> 
> >  
> >> ------------------------------------------------------------
> >> 2.4) FSM in states IDLE or BUSY
> >>
> >> ------------------------------------------------------------
> >> | IRQ              | The hardware issue an interrupt       |  
> > I think we can get this in QUIESCING as well, and it is an indication
> > that QUIESCING is done (for final states).  
> 
> Yes. I did not describe the QUIESCING state in this email.
> because I did not add the patch on QUIESCING.
> With our discussion things becomes clearer and I will
> add it back after corrections.

Ok.

> 
> >
> > If the subchannel is enabled while in STANDBY, we could get an
> > interrupt there as well.  
> 
> No, the subchannel is not enabled while in STANDBY.
> The cio_enable() is issued during the transition from STANDBY
> to IDLE and the event should only happen when the transition
> completed, in IDLE.
> 
> How ever spurious interrupt may happen which should be handled
> locally in the STANDBY state.

Yes, that's what I meant.

> 
> >  
> >> | |                                       |
> >> |                  | triggered by: vfio_ccw_sch_irq        |
> >> |                  | action: update SCSW forward IRQ       |
> >> |                  | state on success: IDLE                |  
> > One thing to consider (and I'm not sure we're handling it correctly
> > right now): What if we don't have a final status for the I/O yet, i.e.
> > there will be another interrupt for the I/O? Should we stay in BUSY in
> > that case?  
> 
> I will take a new look at the interrupt processing.
> If we can be sure another interrupt will come, we may
> wait for it in BUSY.
> In case of doubt we better not and handle the interrupt in IDLE
> otherwise we hang in BUSY.

Agreed. But we should be able to find out whether we got a final state.

> 
> >  
> >> |                  | state on error  : IDLE                |  
> > Hm, what error?  
> 
> There are no error there, therefor no state change.
> A better description is "state on error: NA"

Ok.

> 
> >  
> >> ------------------------------------------------------------
> >>
> >> ------------------------------------------------------------
> >> | OFFLINE          | The host wants the SC offline         |  
> > That's mdev teardown, I guess?
> >  
> >> | |                                       |
> >> |                  | triggered by: css remove              |
> >> |                  | triggered by: css shutdown            |  
> > These as well.
> >  
> >> |                  | action: disable SC                    |
> >> |                  | state on success: STANDBY             |  
> > If it's really a css remove (unbind from driver or device removal), it
> > should not really have a target state, as the vfio-ccw device will be
> > gone, no? This is only correct for mdev removal.  
> 
> May be more complex.
> 
> The event is not right, we should have two different:
> 
> - On css remove and shutdown
> There we have a serious problem, a sub-channel disappeared.
> CIO level handle the quiescing of the sub-channel.
> The final state is NOT_OPER and we must somehow say the guest
> that the hardware configuration changed (machine check?).
> Don't we ?

I think there's a comment/TODO in the code for that.

> 
> - On vfio_release
> This is another thing, the guest goes away, so we need to
> smoothly shut down the sub channel using the QUIESCING state.
> 
> Once again this an indication to rework the OFFLINE event.
> 
> >  
> >> |                  | state on error  : NOT_OPERATIONAL     |
> >> ------------------------------------------------------------
> >>
> >> ------------------------------------------------------------
> >> | STORE_SCHIB      | The hardware issue "schib updated"    |  
> > This is the common I/O layer's sch_event callback. That can mean
> > different things:
> > - Something regarding the paths to the device changed. We can translate
> >    that to "schib updated".
> > - Device is gone. This is a different situation; we may either try to
> >    implement the disconnected state, or we may give up the device.
> >
> > Also, can't we get this event in QUIESCING or STANDBY as well?  
> 
> I think it should be ignored in STANDBY as we did not even
> start to use the sub-channel we have no information on it
> at that moment.

As long as we do update it some time before we need the information, ok.

> 
> In QUIESCING we may have to handle it for internal purpose
> to make sure to shutdown smoothly.
> I will work on this again.
> (Still this OFFLINE refactoring)
> 
> >  
> >> | |                                       |
> >> |                  | triggered by: sch_event               |
> >> |                  | action: Store the SCHIB in IO region  |
> >> |                  | state on success: no change           |  
> > As said, we may want some different handling for "device gone"
> > situations.
> >
> > Also, what happens if we get that event in BUSY? Is the I/O still
> > running?  
> 
> AFAIK yes, it is an asynchronous event.
> It can happen during BUSY.
> However AFAIU we still get an IRQ even we get the device gone
> during I/O operation.

It depends. We either get a deferred cc 3, or a machine check (but no
interrupt).

> 
> >  
> >> |                  | state on error  : NOT_OPERATIONAL     |
> >> ------------------------------------------------------------
> >>
> >> NOTE 1:
> >>      All out of band IO related instructions, XSCH, CSCH, HSCH
> >>      can be started in both states IDLE and BUSY.  
> > Hm. What do you mean by "out of band"?  
> May be a bad description, I made a difference between SSCH/IRQ
> two event handling I/O operations with a little protocoling:
> We send SSCH during IDLE we move to BUSY we expect IRQ
> and when IRQ we move to IDLE.
> 
> And the other instructions which can go aside of the
> protocoling direct to the hardware whatever the state is.
> 
> >
> > You can, of course, get all of the instructions above in both IDLE and
> > BUSY states. The question is what we want to do with them. Do we want
> > to put them through to the hardware in any case? If we do e.g. a hsch
> > and get a cc 2, we don't want to drop to IDLE, but stay in BUSY (as the
> > start function is likely still running).  
> 
> In the case the instruction is accepted (CC=0):
> 
> CANCEL: XSCH
>      in a first draw should not go to the device driver.
>      later we may check if we got a cancel before starting the SSCH 
> instruction
>      inside of the SSCH action.

Ok. Although we probably don't need to dwell on this too much, the
"always cc 2 on xsch" approach should already cover guest usage, I
guess.

> 
> CLEAR: CSCH
>      goes directly to hardware, no state change.
>      BUSY state: we will get an interrupt with clear pending which drive 
> us to IDLE
> 
> HALT: HSCH
>      goes directly to hardware, no state change.
>      BUSY state: we will get an interrupt with halt pending which drive 
> us to IDLE

I'm not sure about that; csch/hsch are different. csch clears any other
pending state; hsch may return busy, pending on what is currently going
on.

> 
> In all cases the CCW chain is kept until we get the confirmation that the
> program is not used any more by the sub-channel.

For csch, that would be immediately after cc 0.

> 
> In the case the instruction is not accepted (CC!=0)
> we must return the answer to the guest.

Yes, that's only for hsch, though.

> 
> >
> > What about RSCH?  
> 
> The CCW program must be kept "alive" in the kernel waiting to be resumed.
> 
> If we find the SUSPEND bit inside the CCW program during the processing
> of a SSCH, we should return a new "SUSPENDED" state to the state machine.
> 
> When we receive the RSCH from the guest
>      - We verify that guest cleared the SUSPEND bit of the current CCW
>        and fetch the new chain starting at this CCW
>      - we forward it to the hardware and go to BUSY state
> 
> 
> Another thing which will induce a new state is the CCW PCI bit which will
> induce intermediate status.
> (by the way we also should handle the ORB I bit)

This is something that needs handling, agreed.

> 
> >  
> >> NOTE 2:
> >>      Currently IRQ means solicited interrupt, but handle both
> >>      solicited and unsolicited interrupts.
> >>      The unsolicited interrupt event may be implemented separatly.  
> > We need to be ready for unsolicited interrupts at any point in time. It
> > mainly depends on the state whether we want to forward them to the
> > guest (BUSY, IDLE) or not (QUIESCING, STANDBY).  
> 
> 100% agree
> 
> >  
> >> 2.5) FSM in any state
> >>
> >> ------------------------------------------------------------
> >> | NOT_OPERATIONAL  | The device is released                |
> >> | |                                       |
> >> |                  | triggered by: mdev_release            |
> >> |                  | action: unregister IOMMU notifier     |
> >> |                  | state on success: NOT_OPERATIONAL     |
> >> ------------------------------------------------------------  
> > Confused. Isn't that "device gone or not operational" as well?
> >  
> 
> It is confusing, as said before, I need to revisit the OFFLINE event.
> 
> instead of using the cover letter, I will enhance the vfio-ccw documentation
> to cover the state machine.

Sounds like a good idea.

> 
> Thanks a lot for the reviewing.

I hope I did not add to the confusion :)