linux-kernel - Re: [PATCH v2 5/5] vfio: ccw: add traceponits for interesting error paths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180430170358.0ee6fe6a.cohuck@redhat.com>
Date:   Mon, 30 Apr 2018 17:03:58 +0200
From:   Cornelia Huck <cohuck@...hat.com>
To:     Halil Pasic <pasic@...ux.ibm.com>
Cc:     Dong Jia Shi <bjsdjshi@...ux.vnet.ibm.com>,
        Halil Pasic <pasic@...ux.vnet.ibm.com>,
        linux-kernel@...r.kernel.org, linux-s390@...r.kernel.org,
        kvm@...r.kernel.org, borntraeger@...ibm.com,
        bjsdjshi@...ux.ibm.com, pmorel@...ux.ibm.com
Subject: Re: [PATCH v2 5/5] vfio: ccw: add traceponits for interesting error
 paths

On Mon, 30 Apr 2018 16:14:21 +0200
Halil Pasic <pasic@...ux.ibm.com> wrote:

> On 04/30/2018 01:51 PM, Cornelia Huck wrote:
> > On Sat, 28 Apr 2018 13:50:23 +0800
> > Dong Jia Shi <bjsdjshi@...ux.vnet.ibm.com> wrote:
> >   
> >> * Cornelia Huck <cohuck@...hat.com> [2018-04-27 12:13:53 +0200]:
> >>  
> >>> On Mon, 23 Apr 2018 13:01:13 +0200
> >>> Dong Jia Shi <bjsdjshi@...ux.vnet.ibm.com> wrote:
> >>>
> >>> typo in subject: s/traceponits/tracepoints/
> >>>      
> >>>> From: Halil Pasic <pasic@...ux.vnet.ibm.com>
> >>>>
> >>>> Add some tracepoints so we can inspect what is not working as is should.
> >>>>
> >>>> Signed-off-by: Halil Pasic <pasic@...ux.vnet.ibm.com>
> >>>> Signed-off-by: Dong Jia Shi <bjsdjshi@...ux.vnet.ibm.com>
> >>>> ---
> >>>>   drivers/s390/cio/Makefile         |  1 +
> >>>>   drivers/s390/cio/vfio_ccw_fsm.c   | 16 +++++++-
> >>>>   drivers/s390/cio/vfio_ccw_trace.h | 77 +++++++++++++++++++++++++++++++++++++++
> >>>>   3 files changed, 93 insertions(+), 1 deletion(-)
> >>>>   create mode 100644 drivers/s390/cio/vfio_ccw_trace.h  
> >>>
> >>>      
> >>>> @@ -135,6 +142,8 @@ static void fsm_io_request(struct vfio_ccw_private *private,
> >>>>   			goto err_out;
> >>>>   
> >>>>   		io_region->ret_code = cp_prefetch(&private->cp);
> >>>> +		trace_vfio_ccw_cp_prefetch(get_schid(private),
> >>>> +					   io_region->ret_code);
> >>>>   		if (io_region->ret_code) {
> >>>>   			cp_free(&private->cp);
> >>>>   			goto err_out;
> >>>> @@ -142,11 +151,13 @@ static void fsm_io_request(struct vfio_ccw_private *private,
> >>>>   
> >>>>   		/* Start channel program and wait for I/O interrupt. */
> >>>>   		io_region->ret_code = fsm_io_helper(private);
> >>>> +		trace_vfio_ccw_fsm_io_helper(get_schid(private),
> >>>> +					     io_region->ret_code);
> >>>>   		if (io_region->ret_code) {
> >>>>   			cp_free(&private->cp);
> >>>>   			goto err_out;
> >>>>   		}
> >>>> -		return;
> >>>> +		goto out;
> >>>>   	} else if (scsw->cmd.fctl & SCSW_FCTL_HALT_FUNC) {
> >>>>   		/* XXX: Handle halt. */
> >>>>   		io_region->ret_code = -EOPNOTSUPP;
> >>>> @@ -159,6 +170,9 @@ static void fsm_io_request(struct vfio_ccw_private *private,
> >>>>   
> >>>>   err_out:
> >>>>   	private->state = VFIO_CCW_STATE_IDLE;
> >>>> +out:
> >>>> +	trace_vfio_ccw_io_fctl(scsw->cmd.fctl, get_schid(private),
> >>>> +			       io_region->ret_code);
> >>>>   }
> >>>>   
> >>>>   /*  
> >>>
> >>> I really don't want to bikeshed, especially as some tracepoints are
> >>> better than no tracepoints, but...
> >>>
> >>> We now trace fctl/schid/ret_code unconditionally (good).
> >>>
> >>> We trace the outcome of cp_prefetch() and fsm_io_helper()
> >>> unconditionally. We don't, however, trace all things that may go wrong.
> >>> We have the tracepoint at the end, but it cannot tell us where the
> >>> error came from. Should we have tracepoints in every place (in this
> >>> function) that may generate an error? Only if there is an actual error?
> >>> Are the two enough for common debug scenarios?  
> >> Trace actual error sounds like a better idea than trace unconditionally
> >> of these two functions.
> >> These two are not enough for common debug scenarios. For example, we
> >> cann't tell if a -EOPNOTSUPP is a orb->tm.b problem, or error code
> >> returned by cp_init().
> >>
> >> Idea to improve:
> >> 1. Trace actual error.
> >> 2. Define a trace event and add error trace for cp_init().  
> > 
> > Hm. Going from what I have done in the past when doing printk debugging:
> > 
> > - stick in a message that is always hit, with some information about
> >    parameters, if it makes sense
> > - stick in a message "foo happened!" in the error branches
> >     - or, alternatively, trace the called functions
> > 
> > So tracing on failure only might be more useful? Have all failure paths
> > under a common knob to turn on/off?
> >   
> >>> Opinions? We can just go ahead with this and improve things later
> >>> on, I guess.
> >>>      
> >> I think it's also fine to do this - better something than nothing. We
> >> could at least have a code base to be improved to make everybody
> >> happier in future.  
> > 
> > Maybe keep the patch as it is now, except trace the errors only
> > (keeping the fctl trace point)?  
> 
> What do you mean by this sentence. Get rid of vfio_ccw_io_fctl or get
> rid of vfio_ccw_cp_prefetch and vfio_ccw_fsm_io_helper, or get don't
> get rid of any, but make some conditional (!errno)?

The third option.

> 
> > 
> > Halil, as you wrote the patch (and I presume you found it helpful):
> > What is your opinion?
> >   
> 
> I'm in favor of this patch (as previously stated here
> https://patchwork.kernel.org/patch/10298305/). And regarding the
> questions under discussion I'm mostly fine either way.

OK.

> 
> I think the naming of this fctl thing is a bit cryptic,
> but if we don't see this as ABI I'm fine with it -- can be improved.
> What would be a better name? I was thinking along the lines accept_request.
> (Bad error code would mean that the request did not get accepted. Good
> code does not mean the requested function was performed successfully.)

I think fctl is fine (if you don't understand what 'fctl' is, you're
unlikely to understand it even if it were named differently.)

> 
> Also I think vfio_ccw_io_fctl with no zero error code would make sense
> as dev_warn. If I were an admin looking into a problem I would very much
> appreciate seeing something in the messages log (and not having to enable
> tracing first). This point seems to be a good one for high level 'request gone
> bad' kind of report. Opinions?

I'd also exclude -EOPNOTSUPP (as this also might happen with e.g. a halt/clear enabled user space, which probes availability of halt/clear support by giving it a try once (yes, I really want to post my patches this week.))