lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 07 May 2020 00:20:51 +0100
From:   Edwin Török <edwin@...rok.net>
To:     "Darrick J. Wong" <darrick.wong@...cle.com>
Cc:     linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: PROBLEM: XFS in-memory corruption with reflinks and duperemove:
 XFS (dm-4): Internal error xfs_trans_cancel at line 1048 of file
 fs/xfs/xfs_trans.c.  Caller xfs_reflink_remap_extent+0x100/0x560

On Wed, 2020-05-06 at 15:47 -0700, Darrick J. Wong wrote:
> On Wed, May 06, 2020 at 12:07:12AM +0100, Edwin Török wrote:
> > 
> > On 5 May 2020 01:58:11 BST, "Darrick J. Wong" <
> > darrick.wong@...cle.com> wrote:
> > > On Mon, May 04, 2020 at 11:54:05PM +0100, Edwin Török wrote:
> > > > On Mon, 2020-05-04 at 08:21 -0700, Darrick J. Wong wrote:
> > > > > On Mon, Apr 27, 2020 at 10:15:57AM +0100, Edwin Török wrote:
> > > > > > On Tue, 2020-04-21 at 10:16 -0700, Darrick J. Wong wrote:
> > > > > > > On Sat, Apr 18, 2020 at 11:19:03AM +0100, Edwin Török
> > > > > > > wrote:
> > > > > > > > [1.] One line summary of the problem:
> > > > > > > > 
> > > > > > > > I 100% reproducibly get XFS in-memory data corruption
> > > > > > > > when
> > > > > > > > running
> > > > > > > > duperemove on an XFS filesystem with reflinks, even
> > > > > > > > after
> > > > > > > > running
> > > > > > > > xfs_repair and repeating the operation.
> > > > > > > > 
> > > > > > > > [2.] Full description of the problem/report:
> > > > > > > > Ubuntu bugreport here: 
> > > > > > > > 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1873555
> > > > > > > 
> > > > > > > Hmm.  I recently fixed an uninitialized variable error in
> > > > > > > xfs_reflink_remap_extent.
> > > > > > > 
> > > > > > > Does applying that patch to the ubuntu kernel (or running
> > > > > > > the
> > > > > > > same
> > > > > > > workload on the 5.7-rc2 ubuntu mainline kernel) fix this?
> > > > > > > 
> > > > > > > https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7-rc2/
> > > > > > 
> > > > > > [...]
> > > > > 
> > > > > A smaller testcase would help.
> > > > 
> > > > Found it! I don't think it is corruption at all, more like the
> > > function
> > > > caller not handling an expected error condition, see below:
> > > > 
> > > > I wasn't able to create a testcase that doesn't also need all
> > > > my
> > > data,
> > > > but I found a faster repro which takes only a few minutes:
> > > > 
> > > > 
> > > > sudo lvconvert --merge /dev/mapper/storage-storage--snapshot
> > > > sudo lvcreate -L 32G -s -n storage-snapshot
> > > /dev/mapper/storage-backup
> > > > sudo mount -o noatime /dev/mapper/storage-backup /mnt/storage
> > > > sudo duperemove -d --hashfile hashes /mnt/storage/from-
> > > > 
> > > restic/04683ed4/tmp/fast/c14-sd-92016-2-home/2017-06-07-
> > > 233355/2017-06-
> > > > 07-233355/edwin/.mu/xapian/postlist.DB /mnt/storage/from-
> > > > 
> > > restic/0fcf01c9/tmp/fast/c14-sd-92016-2-home/2017-05-23-
> > > 233346/2017-05-
> > > > 23-233346/edwin/.mu/xapian/postlist.DB /mnt/storage/from-
> > > > 
> > > restic/10350278/tmp/fast/c14-sd-92016-2-home/2017-06-17-
> > > 233341/2017-06-
> > > > 17-233341/edwin/.mu/xapian/postlist.DB
> > > > sudo umount /mnt/storage
> > > > 
> > > > 
> > > > > Or... if you have bpftrace handy, using kretprobes to figure
> > > > > out
> > > > > which
> > > > > function starts the -EIO return that ultimately causes the
> > > > > remap to
> > > > > fail.
> > > > > 
> > > > > (Or failing that, 'trace-cmd -e xfs_buf_ioerr' to see if it
> > > uncovers
> > > > > anything.)
> > > > 
> > > > With this xfs.bt:
> > > > 
> > > > kprobe:xfs_iread_extents,
> > > > kprobe:xfs_bmap_add_extent_unwritten_real,
> > > > kprobe:xfs_bmap_del_extent_delay,
> > > > kprobe:xfs_bmap_del_extent_real,
> > > > kprobe:xfs_bmap_extents_to_btree,
> > > > kprobe:xfs_bmap_btree_to_extents,
> > > > kprobe:__xfs_bunmapi,
> > > > kprobe:xfs_defer_finish
> > > > {
> > > >     @st[tid] = kstack();
> > > >     @has[tid] = 1;
> > > > }
> > > > 
> > > > kretprobe:xfs_iread_extents,
> > > > kretprobe:xfs_bmap_add_extent_unwritten_real,
> > > > kretprobe:xfs_bmap_del_extent_delay,
> > > > kretprobe:xfs_bmap_del_extent_real,
> > > > kretprobe:xfs_bmap_extents_to_btree,
> > > > kretprobe:xfs_bmap_btree_to_extents,
> > > > kretprobe:__xfs_bunmapi,
> > > > kretprobe:xfs_defer_finish
> > > > /(retval!=0) && (@has[tid])/ {
> > > >     // kstack does not work here
> > > >     @errors[(int32)retval, probe, @st[tid]] = count();
> > > >     delete(@st[tid]);
> > > >     delete(@has[tid]);
> > > > }
> > > > 
> > > > BEGIN { printf("START\n"); }
> > > > END { print(@errors); }
> > > > 
> > > > I got this:
> > > > @errors[-28, kretprobe:xfs_bmap_del_extent_real, 
> > > >     xfs_bmap_del_extent_real+1
> > > >     kretprobe_trampoline+0
> > > >     xfs_reflink_remap_blocks+286
> > > >     xfs_file_remap_range+272
> > > >     vfs_dedupe_file_range_one+301
> > > >     vfs_dedupe_file_range+342
> > > >     do_vfs_ioctl+832
> > > >     ksys_ioctl+103
> > > >     __x64_sys_ioctl+26
> > > >     do_syscall_64+87
> > > >     entry_SYSCALL_64_after_hwframe+68
> > > > ]: 1
> > > > 
> > > > -28 is ENOSPC, but I have more than 1TiB free (and plenty of
> > > > free
> > > > inodes too).
> > > > 
> > > > Poking around in that code I found this block of code:
> > > >   >       /*
> > > >   >        * If it's the case where the directory code is
> > > > running
> > > with
> > > > no block
> > > >   >        * reservation, and the deleted block is in the
> > > > middle of
> > > its
> > > > extent,
> > > >   >        * and the resulting insert of an extent would cause
> > > > transformation to
> > > >   >        * btree format, then reject it.  The calling code
> > > > will
> > > then
> > > > swap blocks
> > > >   >        * around instead.  We have to do this now, rather
> > > > than
> > > > waiting for the
> > > >   >        * conversion to btree format, since the transaction
> > > > will
> > > be
> > > > dirty then.
> > > >   >        */
> > > >   >       if (tp->t_blk_res == 0 &&
> > > >   >           XFS_IFORK_FORMAT(ip, whichfork) ==
> > > XFS_DINODE_FMT_EXTENTS
> > > > &&
> > > >   >           XFS_IFORK_NEXTENTS(ip, whichfork) >=
> > > >   >       >       >       XFS_IFORK_MAXEXT(ip, whichfork) &&
> > > >   >           del->br_startoff > got.br_startoff && del_endoff
> > > > <
> > > > got_endoff) {
> > > >                   xfs_warn(mp, "returning ENOSPC in %s",
> > > > __func__);
> > > >   >       >       return -ENOSPC;
> > > >               }
> > > 
> > > Aww, gross, that's some weird thing that exists for directory
> > > code. :/
> > > 
> > > I bet the problem here is that the remap transaction is running
> > > out of
> > > block reservation.  Looking at xfs_reflink_remap_extent it looks
> > > like
> > > we
> > > reserve one extent mapping's worth of blocks for the whole
> > > operation.
> > > I don't think is correct, since we're removing a whole mapping
> > > and
> > > inserting another mapping, but only allowing for one bmbt split.
> > > 
> > > Hmm, if you find the chunk of xfs_reflink_remap_extent:
> > > 
> > > 	/* Start a rolling transaction to switch the mappings */
> > > 	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > > 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> > > 0, &tp);
> > > 
> > > and then change the resblks= line to read:
> > > 
> > > 	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK) *
> > > 2;
> > > 
> > > Does that make the crash go away?
> > 
> > No, the smallest multiplier that works for this file is 14, perhaps
> > the formula here needs some of the other space macros? I don't see
> > why
> > 14 would be the answer.  I'll let it run on the other files to see
> > if
> > that's enough in general.
> 
> Ok.  14 also confirms my suspicions that we're running out of block
> reservation, probably because of the length of the transaction chain.

FWIW running duperemove on the rest of the FS triggered the issue again
even with 14 on the exact same file.


> 
> > > Or, send me a metadump and let me bang my head on this tomorrow.
> > > :)
> > 
> > That'll be quite large, is there a way to metadump only a subdir?
> 
> Hm... could you send me a filefrag -v of the files that you're
> feeding
> to duperemove?  That might give me a better idea of the reflinking
> factor of the filesystem.

sudo filefrag -v /mnt/storage/from-restic/04683ed4/tmp/fast/c14-sd-
92016-2-home/2017-06-07-233355/2017-06-07-
233355/edwin/.mu/xapian/postlist.DB /mnt/storage/from-
restic/0fcf01c9/tmp/fast/c14-sd-92016-2-home/2017-05-23-233346/2017-05-
23-233346/edwin/.mu/xapian/postlist.DB /mnt/storage/from-
restic/10350278/tmp/fast/c14-sd-92016-2-home/2017-06-17-233341/2017-06-
17-233341/edwin/.mu/xapian/postlist.DB 
Filesystem type is: 58465342
File size of /mnt/storage/from-restic/04683ed4/tmp/fast/c14-sd-92016-2-
home/2017-06-07-233355/2017-06-07-233355/edwin/.mu/xapian/postlist.DB
is 2255716352 (550712 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..     650:  289205390..
289206040:    651:             shared
   1:      651..     827:  288864297.. 288864473:    177:  289206041:
shared
   2:      828..    1098:  289206041.. 289206311:    271:  288864474:
shared
   3:     1099..    1490:  288947797.. 288948188:    392:  289206312:
shared
   4:     1491..    1928:  289206312.. 289206749:    438:  288948189:
shared
   5:     1929..    3078:  289024844.. 289025993:   1150:  289206750:
shared
   6:     3079..    7504:  289206750.. 289211175:   4426:  289025994:
shared
   7:     7505..   10083:  289030420.. 289032998:   2579:  289211176:
shared
   8:    10084..   11930:  289211176.. 289213022:   1847:  289032999:
shared
   9:    11931..   13072:  289034846.. 289035987:   1142:  289213023:
shared
  10:    13073..   14183:  289213023.. 289214133:   1111:  289035988:
shared
  11:    14184..   15536:  289037099.. 289038451:   1353:  289214134:
shared
  12:    15537..   20897:  289219223.. 289224583:   5361:  289038452:
shared
  13:    20898..   23720:  289043813.. 289046635:   2823:  289224584:
shared
  14:    23721..   24004:  289214134.. 289214417:    284:  289046636:
shared
  15:    24005..   24784:  289046920.. 289047699:    780:  289214418:
shared
  16:    24785..   31903:  289224584.. 289231702:   7119:  289047700:
shared
  17:    31904..   34332:  289096507.. 289098935:   2429:  289231703:
shared
  18:    34333..   35396:  289214898.. 289215961:   1064:  289098936:
shared
  19:    35397..   45075:  289153956.. 289163634:   9679:  289215962:
shared
  20:    45076..  550711:  289336838.. 289842473: 505636:  289163635:
last,shared,eof
/mnt/storage/from-restic/04683ed4/tmp/fast/c14-sd-92016-2-home/2017-06-
07-233355/2017-06-07-233355/edwin/.mu/xapian/postlist.DB: 21 extents
found
File size of /mnt/storage/from-restic/0fcf01c9/tmp/fast/c14-sd-92016-2-
home/2017-05-23-233346/2017-05-23-233346/edwin/.mu/xapian/postlist.DB
is 2255716352 (550712 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..     650:  289205390..
289206040:    651:             shared
   1:      651..     827:  288864297.. 288864473:    177:  289206041:
shared
   2:      828..    1098:  289206041.. 289206311:    271:  288864474:
shared
   3:     1099..    1490:  288947797.. 288948188:    392:  289206312:
shared
   4:     1491..    1928:  289206312.. 289206749:    438:  288948189:
shared
   5:     1929..    3078:  289024844.. 289025993:   1150:  289206750:
shared
   6:     3079..    7504:  289206750.. 289211175:   4426:  289025994:
shared
   7:     7505..   10083:  289030420.. 289032998:   2579:  289211176:
shared
   8:    10084..   11930:  289211176.. 289213022:   1847:  289032999:
shared
   9:    11931..   13072:  289034846.. 289035987:   1142:  289213023:
shared
  10:    13073..   14183:  289213023.. 289214133:   1111:  289035988:
shared
  11:    14184..   15536:  289037099.. 289038451:   1353:  289214134:
shared
  12:    15537..   20897:  289219223.. 289224583:   5361:  289038452:
shared
  13:    20898..   23720:  289043813.. 289046635:   2823:  289224584:
shared
  14:    23721..   24004:  289214134.. 289214417:    284:  289046636:
shared
  15:    24005..   24784:  289046920.. 289047699:    780:  289214418:
shared
  16:    24785..   31903:  289224584.. 289231702:   7119:  289047700:
shared
  17:    31904..   34332:  289096507.. 289098935:   2429:  289231703:
shared
  18:    34333..   35396:  289214898.. 289215961:   1064:  289098936:
shared
  19:    35397..   45075:  289153956.. 289163634:   9679:  289215962:
shared
  20:    45076..  550711:  289336838.. 289842473: 505636:  289163635:
last,shared,eof
/mnt/storage/from-restic/0fcf01c9/tmp/fast/c14-sd-92016-2-home/2017-05-
23-233346/2017-05-23-233346/edwin/.mu/xapian/postlist.DB: 21 extents
found
File size of /mnt/storage/from-restic/10350278/tmp/fast/c14-sd-92016-2-
home/2017-06-17-233341/2017-06-17-233341/edwin/.mu/xapian/postlist.DB
is 2255716352 (550712 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected:
flags:
   0:        0..     650:  289205390..
289206040:    651:             shared
   1:      651..     827:  288864297.. 288864473:    177:  289206041:
shared
   2:      828..    1098:  289206041.. 289206311:    271:  288864474:
shared
   3:     1099..    1490:  288947797.. 288948188:    392:  289206312:
shared
   4:     1491..    1928:  289206312.. 289206749:    438:  288948189:
shared
   5:     1929..    3078:  289024844.. 289025993:   1150:  289206750:
shared
   6:     3079..    7504:  289206750.. 289211175:   4426:  289025994:
shared
   7:     7505..   10083:  289030420.. 289032998:   2579:  289211176:
shared
   8:    10084..   11930:  289211176.. 289213022:   1847:  289032999:
shared
   9:    11931..   13072:  289034846.. 289035987:   1142:  289213023:
shared
  10:    13073..   14183:  289213023.. 289214133:   1111:  289035988:
shared
  11:    14184..   15536:  289037099.. 289038451:   1353:  289214134:
shared
  12:    15537..   20897:  289219223.. 289224583:   5361:  289038452:
shared
  13:    20898..   23720:  289043813.. 289046635:   2823:  289224584:
shared
  14:    23721..   24004:  289214134.. 289214417:    284:  289046636:
shared
  15:    24005..   24784:  289046920.. 289047699:    780:  289214418:
shared
  16:    24785..   31903:  289224584.. 289231702:   7119:  289047700:
shared
  17:    31904..   34332:  289096507.. 289098935:   2429:  289231703:
shared
  18:    34333..   35396:  289214898.. 289215961:   1064:  289098936:
shared
  19:    35397..   45075:  289153956.. 289163634:   9679:  289215962:
shared
  20:    45076..  550711:  289336838.. 289842473: 505636:  289163635:
last,shared,eof
/mnt/storage/from-restic/10350278/tmp/fast/c14-sd-92016-2-home/2017-06-
17-233341/2017-06-17-233341/edwin/.mu/xapian/postlist.DB: 21 extents
found

Looking at duperemove's database there are some files with high reflink
counts:

sqlite> select hex(digest), count(ino) as c from hashes group by digest
order by c desc limit 30;
E47862EAC39A71995587EC6F98C0F77E|120826
603774B45A054235A3EA9973C0326B1A|1146
1155F3DE7E9760BD3774A084F651E734|1126
737B0085639F6182B8975B253673CC0F|1126
1E49066D483DC85F597AA1E311B759C8|852
25F03E77F28B29812EF7BC467E2AB8A3|720
3D1DA8B943984CAA28D8143E57CF2AD2|720
49EDE0A7B5BF869DDF0EC58D97B88CA6|720
514E20541D0F06B15042E7F60D656F6E|720
9D05E927C2CBFF1B95B4CD423CB82683|720
A5630A5CCA0D495679BCDB0E9B118459|720
EE94AF6B4F3E1B53781AC5202297D4CF|720
66952AB2F45AD72E5E4F7F5753C45A42|702
3028010B6254DA58C5A7A83FD57EA5D8|612
8A35D931C2A52EC0BAE0F8499FE9D56B|603
167F01605D23CB6DDA52FD3036CB9925|593
5BC8DAF52583CA428096C2B5046B2329|593
C1A98787EF8CF318311AE2FAECB526EC|593
ABDC9B93518EFB347F6FBD6FE7D2F8FE|421
4DAA7ECCAA161C148624495563211F81|395
4E08DB53E7E602136CB0E5A8886761C7|395
5DFDE0C5AA5CA52429389FF516B69012|395
72E52A135D6DF5C931E3660FA30F5856|395
7D7F1D1345C7E22E211B9253F2A0DC34|395
87B61864CDF325C831B6B2B6B99B9217|395
A191E341458429624A534414547251FD|395
A932C7FD8269F6941A84F9F4DC573E59|395
B0B52DE7472FC02EB92836B875A92A05|395
A84CFAD1E5BCD8CC8DA1FBFEB8BAD520|363
1304E6AF0087B5498184C9532B79207F|358

> 
> (Obviously, a full metadump would be useful for confirming the shape
> of
> the refcount btree, but...first things first let's look at the
> filefrag
> output.)

I'll try to gather one, and find a place to store/share it.

Best regards,
--Edwin

> 
> --D
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ