linux-kernel - Re: [PATCH v1 8/8] block: add add_disk() failure injection support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e938c21f-3872-232c-4956-dfa53aec579b@suse.de>
Date:   Wed, 12 May 2021 17:22:48 +0200
From:   Hannes Reinecke <hare@...e.de>
To:     Luis Chamberlain <mcgrof@...nel.org>, axboe@...nel.dk
Cc:     bvanassche@....org, ming.lei@...hat.com, hch@...radead.org,
        jack@...e.cz, osandov@...com, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1 8/8] block: add add_disk() failure injection support

On 5/12/21 8:46 AM, Luis Chamberlain wrote:
> For a long time we have lived without any error handling
> on the add_disk() error path. Now that we have some initial
> error handling, add error injection support for its path so
> that we can test it and ensure we don't regress this path
> moving forward.
> 
> This only adds runtime code *iff* the new bool CONFIG_FAIL_ADD_DISK is
> enabled in your kernel. If you don't have this enabled this provides
> no new functional. When CONFIG_FAIL_ADD_DISK is disabled the new routine
> blk_should_fail_add_disk() ends up being transformed to if (false), and
> so the compiler should optimize these out as dead code producing no
> new effective binary changes.
> 
> Failure injection lets us configure at boot how often we want a failure
> to take place by specifying the interval, the probability, and when needed
> a size constraint. We don't need to test for size constraints for
> add_disk() and so ignore that part of error injection. Although testing
> early boot failures with add_disk() failures might be useful we don't
> to make add_disk() fail every time as otherwise we wouldn't be able to
> boot. So enabling add_disk() error injection requires a second post
> boot step where you specify where in the add_disk() code path you want
> to enable failure injection for. This lets us verify correctness of
> the different error handling parts of add_disk(), while also allowing
> a respective blktests test to grow dynamically in case the add_disk()
> paths grows.
> 
> We currently enable 11 code paths on add_disk() which can fail
> and we can test for:
> 
> 	# ls -1 /sys/kernel/debug/block/config_fail_add_disk/
> 	alloc_devt
> 	alloc_events
> 	bdi_register
> 	device_add
> 	disk_add_events
> 	get_queue
> 	integrity_add
> 	register_disk
> 	register_queue
> 	sysfs_bdi_link
> 	sysfs_depr_link
> 
> If you want to modify the configuration of fail_add_disk dynamically
> at boot, you can enable CONFIG_FAULT_INJECTION_DEBUG_FS. If you've
> enabled CONFIG_FAIL_ADD_DISK you will see these knobs:
> 
> 	# ls -1 /sys/kernel/debug/block/fail_add_disk/
> 	interval
> 	probability
> 	space
> 	task-filter
> 	times
> 	verbose
> 	verbose_ratelimit_burst
> 	verbose_ratelimit_interval_ms
> 
> Suggested-by: Bart Van Assche <bvanassche@....org>
> Signed-off-by: Luis Chamberlain <mcgrof@...nel.org>
> ---
>   .../fault-injection/fault-injection.rst       | 23 ++++++++
>   block/Makefile                                |  1 +
>   block/blk-core.c                              |  1 +
>   block/blk.h                                   | 55 ++++++++++++++++++
>   block/failure-injection.c                     | 54 ++++++++++++++++++
>   block/genhd.c                                 | 57 +++++++++++++++++++
>   lib/Kconfig.debug                             | 13 +++++
>   7 files changed, 204 insertions(+)
>   create mode 100644 block/failure-injection.c
> 
[ .. ]
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d1467658361f..4fccc0fad190 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1917,6 +1917,19 @@ config FAULT_INJECTION_USERCOPY
>   	  Provides fault-injection capability to inject failures
>   	  in usercopy functions (copy_from_user(), get_user(), ...).
>   
> +config FAIL_ADD_DISK
> +	bool "Fault-injection capability for add_disk() callers"
> +	depends on FAULT_INJECTION && BLOCK
> +	help
> +	  Provide fault-injection capability for the add_disk() block layer
> +	  call path. This allows the kernel to provide error injection when
> +	  the add_disk() call is made. You would use something like blktests
> +	  test against this or just load the null_blk driver. This only
> +	  enables the error injection functionality. To use it you must
> +	  configure which path you want to trigger on error on using debugfs
> +	  under /sys/kernel/debug/block/config_fail_add_disk/. By default
> +	  all of these are disabled.
> +
>   config FAIL_MAKE_REQUEST
>   	bool "Fault-injection capability for disk IO"
>   	depends on FAULT_INJECTION && BLOCK
> 

Hmm. Not a fan of this approach.

Having to have a separate piece of code just to test individual 
functions, _and_ having to place hooks in the code to _simulate_ a 
failure seems rather fragile to me.

I would have vastly preferred if we could to this via generic tools like 
ebpf or livepatching.
Also I'm worried that this approach doesn't really scale; taken to 
extremes we would have to add duplicate calls to each and every function 
for full error injection, essentially double the size of the code just 
on the off-chance that someone wants to do error injection.

So I'd rather delegate the topic of error injection to a more general 
discussion (LSF springs to mind ...), and then agree on a framework 
which is suitable for every function.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@...e.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer