[ Upstream commit b8883b61f2 ]
btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held,
as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was
calling it after do_reclaim_sweep() returns, at which point
space_info->lock is no longer held.
Fix this by explicitly acquiring space_info->lock before clearing the
periodic reclaim ready flag in btrfs_reclaim_sweep().
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/
Fixes: 19eff93dc7 ("btrfs: fix periodic reclaim condition")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 968f19c5b1 upstream.
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode 'none').
[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even if the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).
This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.
But the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).
So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
btrfs requires no modification, this leads to the false csum mismatch.
This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.
For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.
[FIX]
I have considered the following ideas to solve the problem:
- Make direct IO to always skip data checksum
This not only requires a new incompatible flag, as it breaks the
current per-inode NODATASUM flag.
But also requires extra handling for no csum found cases.
And this also reduces our checksum protection.
- Let hardware handle all the checksum
AKA, just nodatasum mount option.
That requires trust for hardware (which is not that trustful in a lot
of cases), and it's not generic at all.
- Always fallback to buffered write if the inode requires checksum
This was suggested by Christoph, and is the solution utilized by this
patch.
The cost is obvious, the extra buffer copying into page cache, thus it
reduces the performance.
But at least it's still user configurable, if the end user still wants
the zero-copy performance, just set NODATASUM flag for the inode
(which is a common practice for VM images on btrfs).
Since we cannot trust user space programs to keep the buffer
consistent during direct IO, we have no choice but always falling back
to buffered IO. At least by this, we avoid the more deadly false data
checksum mismatch error.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 52ee9965d0 ]
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.
However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.
We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.
Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d00cbce0a7 ]
These are two simple macros which ensure that a pointer is initialized
to NULL and with the proper cleanup attribute for it.
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 52ee9965d0 ("btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6a1ab50135 ]
The stripe offset calculation in the zoned code for raid0 and raid10
wrongly uses map->stripe_size to calculate it. In fact, map->stripe_size is
the size of the device extent composing the block group, which always is
the zone_size on the zoned setup.
Fix it by using BTRFS_STRIPE_LEN and BTRFS_STRIPE_LEN_SHIFT. Also, optimize
the calculation a bit by doing the common calculation only once.
Fixes: c0d90a79e8 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups")
CC: stable@vger.kernel.org # 6.17+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 52ee9965d0 ("btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e2d848649e ]
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.
However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.
We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.
Fixes: c0d90a79e8 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit dda3ec9ee6 ]
When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.
However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.
We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.
Fixes: 568220fa96 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c0d90a79e8 ]
When one of two zones composing a DUP block group is a conventional zone,
we have the zone_info[i]->alloc_offset = WP_CONVENTIONAL. That will, of
course, not match the write pointer of the other zone, and fails that
block group.
This commit solves that issue by properly recovering the emulated write
pointer from the last allocated extent. The offset for the SINGLE, DUP,
and RAID1 are straight-forward: it is same as the end of last allocated
extent. The RAID0 and RAID10 are a bit tricky that we need to do the math
of striping.
This is the kernel equivalent of Naohiro's user-space commit:
"btrfs-progs: zoned: fix alloc_offset calculation for partly
conventional block groups".
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: dda3ec9ee6 ("btrfs: zoned: fixup last alloc pointer after extent removal for RAID1")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 19eff93dc7 ]
Problems with current implementation:
1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
negative reclaimable_bytes to trigger reclaim unexpectedly
2. The "space must be freed between scans" assumption breaks the
two-scan requirement: first scan marks block groups, second scan
reclaims them. Without the second scan, no reclamation occurs.
Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made. And resume
periodic reclaim when there's enough free space.
And we take care if reclaim is making any progress now, so it's
unnecessary to set periodic_reclaim_ready to false when failed to reclaim
a block group.
Fixes: 813d4c6422 ("btrfs: prevent pathological periodic reclaim loops")
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 6207687043 ]
We are considering the used bytes counter of a block group as the amount
to update the space info's reclaim bytes counter after relocating the
block group, but this value alone is often not enough. This is because we
may have a reserved extent (or more) and in that case its size is
reflected in the reserved counter of the block group - the size of the
extent is only transferred from the reserved counter to the used counter
of the block group when the delayed ref for the extent is run - typically
when committing the transaction (or when flushing delayed refs due to
ENOSPC on space reservation). Such call chain for data extents is:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_data_ref()
alloc_reserved_file_extent()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
For metadata extents:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_tree_ref()
alloc_reserved_tree_block()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
Since relocation flushes delalloc, waits for ordered extent completion
and commits the current transaction before doing the actual relocation
work, the correct amount of reclaimed space is therefore the sum of the
"used" and "reserved" counters of the block group before we call
btrfs_relocate_chunk() at btrfs_reclaim_bgs_work().
So fix this by taking the "reserved" counter into consideration.
Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 19eff93dc7 ("btrfs: fix periodic reclaim condition")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ba5d06440c ]
At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter
of the block group while not holding the block group's spinlock. This can
result in races, reported by KCSAN and similar tools, since a concurrent
task can be updating that counter while at btrfs_update_block_group().
So avoid these races by grabbing the counter in a critical section
delimited by the block group's spinlock after setting the block group to
RO mode. This also avoids using two different values of the counter in
case it changes in between each read. This silences KCSAN and is required
for the next patch in the series too.
Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 19eff93dc7 ("btrfs: fix periodic reclaim condition")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 343a63594b ]
The parameter is unused and we can get it from space info if needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 19eff93dc7 ("btrfs: fix periodic reclaim condition")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 587bb33b10 ]
Commit d7f67ac9a9 ("btrfs: relax block-group-tree feature dependency
checks") introduced a regression when it comes to handling unsupported
incompat or compat_ro flags. Beforehand we only printed the flags that
we didn't recognize, afterwards we printed them all, which is less
useful. Fix the error handling so it behaves like it used to.
Fixes: d7f67ac9a9 ("btrfs: relax block-group-tree feature dependency checks")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1c7e9111f4 ]
Fix the error message in btrfs_delete_subvolume() if we can't delete a
subvolume because it has an active swapfile: we were printing the number
of the parent rather than the target.
Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 44e2fda664 ]
Commit b471965fdb ("btrfs: fix replace/scrub failure with
metadata_uuid") fixed the comparison in scrub_verify_one_metadata() to
use metadata_uuid rather than fsid, but left the warning as it was. Fix
it so it matches what we're doing.
Fixes: b471965fdb ("btrfs: fix replace/scrub failure with metadata_uuid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a101727805 ]
Fix a copy-paste error in check_extent_data_ref(): we're printing root
as in the message above, we should be printing objectid.
Fixes: f333a3c7e8 ("btrfs: tree-checker: validate dref root and objectid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 511dc8912a ]
Fix the error message in check_dev_extent_item(), when an overlapping
stripe is encountered. For dev extents, objectid is the disk number and
offset the physical address, so prev_key->objectid should actually be
prev_key->offset.
(I can't take any credit for this one - this was discovered by Chris and
his friend Claude.)
Reported-by: Chris Mason <clm@fb.com>
Fixes: 008e2512dc ("btrfs: tree-checker: add dev extent item checks")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 912d1c6680 ]
Commit 93bba24d4b ("btrfs: Enhance btrfs_trim_fs function to handle
error better") intended to make device trimming continue even if one
device fails, tracking failures and reporting them at the end. However,
it used 'break' instead of 'continue', causing the loop to exit on the
first device failure.
Fix this by replacing 'break' with 'continue'.
Fixes: 93bba24d4b ("btrfs: Enhance btrfs_trim_fs function to handle error better")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit be6324a809 ]
We search with offset (u64)-1 which should never match exactly.
Previously this was handled with BUG(). Now logs an error
and return -EUCLEAN.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Adarsh Das <adarshdas950@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bfb670b918 ]
When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.
Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7c2830f00c ]
[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.
[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c5b1
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.
Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.
This will be way harder to detect without data checksum.
Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.
In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:
" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback
0 32K 64K
Data 1 | | Has csum
Data 2 |XXXXXXXXXXXXXXXX | No csum
Parity | |
In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).
After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.
Now all data is submitted to the storage, and the fs got fully synced.
Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.
[FIX]
Fix the problem by checking the data allocation profile.
If our data allocation profile is either RAID0 or SINGLE, we can allow
true zero-copy direct IO and the end user is fully responsible for any
race.
However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after the data
allocation profile check.
But this fix should still greatly reduce the window of the original bug.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ecb7c2484c ]
If btrfs_search_slot_for_read() returns 1, it means we did not find any
key greater than or equals to the key we asked for, meaning we have
reached the end of the tree and therefore the path is not valid. If
this happens we need to break out of the loop and stop, instead of
continuing and accessing an invalid path.
Fixes: 5223cc60b4 ("btrfs: drop the path before adding qgroup items when enabling qgroups")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2155d0c0a7 ]
When initializing the delayed refs block reserve for a transaction handle
we are passing a type of BTRFS_BLOCK_RSV_DELOPS, which is meant for
delayed items and not for delayed refs. The correct type for delayed refs
is BTRFS_BLOCK_RSV_DELREFS.
On release of any excess space reserved in a local delayed refs reserve,
we also should transfer that excess space to the global block reserve
(it it's full, we return to the space info for general availability).
By initializing a transaction's local delayed refs block reserve with a
type of BTRFS_BLOCK_RSV_DELOPS, we were also causing any excess space
released from the delayed block reserve (fs_info->delayed_block_rsv, used
for delayed inodes and items) to be transferred to the global block
reserve instead of the global delayed refs block reserve. This was an
unintentional change in commit 28270e25c6 ("btrfs: always reserve space
for delayed refs when starting transaction"), but it's not particularly
serious as things tend to cancel out each other most of the time and it's
relatively rare to be anywhere near exhaustion of the global reserve.
Fix this by initializing a transaction's local delayed refs reserve with
a type of BTRFS_BLOCK_RSV_DELREFS and making btrfs_block_rsv_release()
attempt to transfer unused space from such a reserve into the global block
reserve, just as we did before that commit for when the block reserve is
a delayed refs rsv.
Reported-by: Alex Lyakas <alex.lyakas@zadara.com>
Link: https://lore.kernel.org/linux-btrfs/CAOcd+r0FHG5LWzTSu=LknwSoqxfw+C00gFAW7fuX71+Z5AfEew@mail.gmail.com/
Fixes: 28270e25c6 ("btrfs: always reserve space for delayed refs when starting transaction")
Reviewed-by: Alex Lyakas <alex.lyakas@zadara.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3a1f4264da ]
When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the
block group tree to the switch_commits list before calling
switch_commit_roots, as we do for the tree root and the chunk root.
However, the block group tree uses normal root dirty tracking and in any
transaction that does an allocation and dirties a block group, the block
group root will already be linked to a list by the dirty_list field and
this use of list_add_tail() is invalid and corrupts the prev/next
members of block_group_root->dirty_list.
This is apparent on a subsequent list_del on the prev if we enable
CONFIG_DEBUG_LIST:
[32.1571] ------------[ cut here ]------------
[32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538)
[32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607
[32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none)
[32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014
[32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120
[32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202
[32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538
[32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00
[32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff
[32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538
[32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538
[32.1603] FS: 00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000
[32.1605] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0
[32.1609] Call Trace:
[32.1610] <TASK>
[32.1611] switch_commit_roots+0x82/0x1d0 [btrfs]
[32.1615] btrfs_commit_transaction+0x968/0x1550 [btrfs]
[32.1618] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
[32.1621] __iterate_supers+0xe8/0x190
[32.1622] ? __pfx_sync_fs_one_sb+0x10/0x10
[32.1623] ksys_sync+0x63/0xb0
[32.1624] __do_sys_sync+0xe/0x20
[32.1625] do_syscall_64+0x73/0x450
[32.1626] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[32.1627] RIP: 0033:0x7f0c28d05d2b
[32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2
[32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b
[32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d
[32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000
[32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80
[32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034
[32.1643] </TASK>
[32.1644] irq event stamp: 0
[32.1644] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
[32.1648] softirqs last enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
[32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0
[32.1652] ---[ end trace 0000000000000000 ]---
Furthermore, this list corruption eventually (when we happen to add a
new block group) results in getting the switch_commits and
dirty_cowonly_roots lists mixed up and attempting to call update_root
on the tree root which can't be found in the tree root, resulting in a
transaction abort:
[87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1
[87.8272] ------------[ cut here ]------------
[87.8274] BTRFS: Transaction aborted (error -117)
[87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703
[87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none)
[87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014
[87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs]
[87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282
[87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000
[87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270
[87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff
[87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001
[87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0
[87.8307] FS: 00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000
[87.8309] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0
[87.8312] Call Trace:
[87.8313] <TASK>
[87.8314] ? _raw_spin_unlock+0x23/0x40
[87.8315] commit_cowonly_roots+0x1ad/0x250 [btrfs]
[87.8317] ? btrfs_commit_transaction+0x79b/0x1560 [btrfs]
[87.8320] btrfs_commit_transaction+0x8aa/0x1560 [btrfs]
[87.8322] ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
[87.8325] __iterate_supers+0xf1/0x170
[87.8326] ? __pfx_sync_fs_one_sb+0x10/0x10
[87.8327] ksys_sync+0x63/0xb0
[87.8328] __do_sys_sync+0xe/0x20
[87.8329] do_syscall_64+0x73/0x450
[87.8330] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[87.8331] RIP: 0033:0x7f54cdd05d2b
[87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b
[87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d
[87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
[87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80
[87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034
[87.8348] </TASK>
[87.8348] irq event stamp: 0
[87.8349] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
[87.8353] softirqs last enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
[87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0
[87.8357] ---[ end trace 0000000000000000 ]---
[87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted
[87.8360] BTRFS info (device nvme1n1 state EA): forced readonly
[87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction.
[87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted
Since the block group tree was pulled out of the extent tree and uses
normal root dirty tracking, remove the offending extra list_add. This
fixes the list corruption and the resulting fs corruption.
Fixes: 14033b08a0 ("btrfs: don't save block group root into super block")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 51b1fcf71c ]
If we fail to delete the second qgroup relation item, we end up returning
success or -ENOENT in case the first item does not exist, instead of
returning the error from the second item deletion.
Fixes: 73798c465b ("btrfs: qgroup: Try our best to delete qgroup relations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1972f44c18 ]
[BUG]
There is a bug report where a heavily fuzzed fs is mounted with all
rescue mount options, which leads to the following warnings during
unmount:
BTRFS: Transaction aborted (error -22)
Modules linked in:
CPU: 0 UID: 0 PID: 9758 Comm: repro.out Not tainted
6.19.0-rc5-00002-gb71e635feefc #7 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:find_free_extent_update_loop fs/btrfs/extent-tree.c:4208 [inline]
RIP: 0010:find_free_extent+0x52f0/0x5d20 fs/btrfs/extent-tree.c:4611
Call Trace:
<TASK>
btrfs_reserve_extent+0x2cd/0x790 fs/btrfs/extent-tree.c:4705
btrfs_alloc_tree_block+0x1e1/0x10e0 fs/btrfs/extent-tree.c:5157
btrfs_force_cow_block+0x578/0x2410 fs/btrfs/ctree.c:517
btrfs_cow_block+0x3c4/0xa80 fs/btrfs/ctree.c:708
btrfs_search_slot+0xcad/0x2b50 fs/btrfs/ctree.c:2130
btrfs_truncate_inode_items+0x45d/0x2350 fs/btrfs/inode-item.c:499
btrfs_evict_inode+0x923/0xe70 fs/btrfs/inode.c:5628
evict+0x5f4/0xae0 fs/inode.c:837
__dentry_kill+0x209/0x660 fs/dcache.c:670
finish_dput+0xc9/0x480 fs/dcache.c:879
shrink_dcache_for_umount+0xa0/0x170 fs/dcache.c:1661
generic_shutdown_super+0x67/0x2c0 fs/super.c:621
kill_anon_super+0x3b/0x70 fs/super.c:1289
btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2127
deactivate_locked_super+0xbc/0x130 fs/super.c:474
cleanup_mnt+0x425/0x4c0 fs/namespace.c:1318
task_work_run+0x1d4/0x260 kernel/task_work.c:233
exit_task_work include/linux/task_work.h:40 [inline]
do_exit+0x694/0x22f0 kernel/exit.c:971
do_group_exit+0x21c/0x2d0 kernel/exit.c:1112
__do_sys_exit_group kernel/exit.c:1123 [inline]
__se_sys_exit_group kernel/exit.c:1121 [inline]
__x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121
x64_sys_call+0x2210/0x2210 arch/x86/include/generated/asm/syscalls_64.h:232
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xe8/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x44f639
Code: Unable to access opcode bytes at 0x44f60f.
RSP: 002b:00007ffc15c4e088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00000000004c32f0 RCX: 000000000044f639
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001
RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004c32f0
R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001
</TASK>
Since rescue mount options will mark the full fs read-only, there should
be no new transaction triggered.
But during unmount we will evict all inodes, which can trigger a new
transaction, and triggers warnings on a heavily corrupted fs.
[CAUSE]
Btrfs allows new transaction even on a read-only fs, this is to allow
log replay happen even on read-only mounts, just like what ext4/xfs do.
However with rescue mount options, the fs is fully read-only and cannot
be remounted read-write, thus in that case we should also reject any new
transactions.
[FIX]
If we find the fs has rescue mount options, we should treat the fs as
error, so that no new transaction can be started.
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CANypQFYw8Nt8stgbhoycFojOoUmt+BoZ-z8WJOZVxcogDdwm=Q@mail.gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c1c050f92d ]
If we fail to allocate a path or join a transaction, we return from
__cow_file_range_inline() without freeing the reserved qgroup data,
resulting in a leak. Fix this by ensuring we call btrfs_qgroup_free_data()
in such cases.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
This is a stable-only patch. The issue was inadvertently fixed in 6.17 [0]
as part of a refactoring, but this patch serves as a minimal targeted fix
for prior kernels.
Users of filemap_lock_folio() need to guard against the situation where
release_folio() has been invoked during reclaim but the folio was
ultimately not removed from the page cache. This patch covers one location
that was overlooked.
After acquiring the folio, use set_folio_extent_mapped() to ensure the
folio private state is valid. This is especially important in the subpage
case, where the private field is an allocated struct containing bitmap and
lock data.
Without this protection, the race below is possible:
[mm] page cache reclaim path [fs] relocation in subpage mode
shrink_folio_list()
folio_trylock() /* lock acquired */
filemap_release_folio()
mapping->a_ops->release_folio()
btrfs_release_folio()
__btrfs_release_folio()
clear_folio_extent_mapped()
btrfs_detach_subpage()
subpage = folio_detach_private(folio)
btrfs_free_subpage(subpage)
kfree(subpage) /* point A */
prealloc_file_extent_cluster()
filemap_lock_folio()
folio_try_get() /* inc refcount */
folio_lock() /* wait for lock */
if (...)
...
else if (!mapping || !__remove_mapping(..))
/*
* __remove_mapping() returns zero when
* folio_ref_freeze(folio, refcount) fails /* point B */
*/
goto keep_locked /* folio remains in cache */
keep_locked:
folio_unlock(folio) /* lock released */
/* lock acquired */
btrfs_subpage_clear_uptodate()
/* use-after-free */
subpage = folio_get_private(folio)
[0] 4e346baee9 ("btrfs: reloc: unconditionally invalidate the page cache for each cluster")
Fixes: 9d9ea1e68a ("btrfs: subpage: fix relocation potentially overwriting last page data")
Cc: stable@vger.kernel.org # 6.10-6.16
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 38e818718c ]
>From the memory-barriers.txt document regarding memory barrier ordering
guarantees:
(*) These guarantees do not apply to bitfields, because compilers often
generate code to modify these using non-atomic read-modify-write
sequences. Do not attempt to use bitfields to synchronize parallel
algorithms.
(*) Even in cases where bitfields are protected by locks, all fields
in a given bitfield must be protected by one lock. If two fields
in a given bitfield are protected by different locks, the compiler's
non-atomic read-modify-write sequences can cause an update to one
field to corrupt the value of an adjacent field.
btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:
struct btrfs_space_info {
struct btrfs_fs_info * fs_info; /* 0 8 */
struct btrfs_space_info * parent; /* 8 8 */
...
int clamp; /* 172 4 */
unsigned int full:1; /* 176: 0 4 */
unsigned int chunk_alloc:1; /* 176: 1 4 */
unsigned int flush:1; /* 176: 2 4 */
...
Therefore, to be safe from parallel read-modify-writes losing a write to
one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.
Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():
T1 T2
btrfs_commit_transaction
btrfs_clear_space_info_full
data_sinfo->full = 0
READ: full:0, chunk_alloc:0, flush:1
do_async_reclaim_data_space(data_sinfo)
spin_lock(&space_info->lock);
if(list_empty(tickets))
space_info->flush = 0;
READ: full: 0, chunk_alloc:0, flush:1
MOD/WRITE: full: 0, chunk_alloc:0, flush:0
spin_unlock(&space_info->lock);
return;
MOD/WRITE: full:0, chunk_alloc:0, flush:1
and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.
I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
andb $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
andb $0xfe,-0x20(%rax)
So I think this is really a bug on practical systems. I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.
Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
writes to space_info->flush, regardless of locking.
Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[ The context change is due to the commit cc0517fe77
("btrfs: tweak extent/chunk allocation for space_info sub-space")
in v6.16 which is irrelevant to the logic of this patch. ]
Signed-off-by: Rahul Sharma <black.hawk@163.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1d8f69f453 ]
When the BLOCK_GROUP_TREE compat_ro flag is set, the extent root and
csum root fields are getting missed.
This is because EXTENT_TREE_V2 treated these differently, and when
they were split off this special-casing was mistakenly assigned to
BGT rather than the rump EXTENT_TREE_V2. There's no reason why the
existence of the block group tree should mean that we don't record the
details of the last commit's extent root and csum root.
Fix the code in backup_super_roots() so that the correct check gets
made.
Fixes: 1c56ab9919 ("btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 5037b34282 upstream.
When wait_current_trans() is called during start_transaction(), it
currently waits for a blocked transaction without considering whether
the given transaction type actually needs to wait for that particular
transaction state. The btrfs_blocked_trans_types[] array already defines
which transaction types should wait for which transaction states, but
this check was missing in wait_current_trans().
This can lead to a deadlock scenario involving two transactions and
pending ordered extents:
1. Transaction A is in TRANS_STATE_COMMIT_DOING state
2. A worker processing an ordered extent calls start_transaction()
with TRANS_JOIN
3. join_transaction() returns -EBUSY because Transaction A is in
TRANS_STATE_COMMIT_DOING
4. Transaction A moves to TRANS_STATE_UNBLOCKED and completes
5. A new Transaction B is created (TRANS_STATE_RUNNING)
6. The ordered extent from step 2 is added to Transaction B's
pending ordered extents
7. Transaction B immediately starts commit by another task and
enters TRANS_STATE_COMMIT_START
8. The worker finally reaches wait_current_trans(), sees Transaction B
in TRANS_STATE_COMMIT_START (a blocked state), and waits
unconditionally
9. However, TRANS_JOIN should NOT wait for TRANS_STATE_COMMIT_START
according to btrfs_blocked_trans_types[]
10. Transaction B is waiting for pending ordered extents to complete
11. Deadlock: Transaction B waits for ordered extent, ordered extent
waits for Transaction B
This can be illustrated by the following call stacks:
CPU0 CPU1
btrfs_finish_ordered_io()
start_transaction(TRANS_JOIN)
join_transaction()
# -EBUSY (Transaction A is
# TRANS_STATE_COMMIT_DOING)
# Transaction A completes
# Transaction B created
# ordered extent added to
# Transaction B's pending list
btrfs_commit_transaction()
# Transaction B enters
# TRANS_STATE_COMMIT_START
# waiting for pending ordered
# extents
wait_current_trans()
# waits for Transaction B
# (should not wait!)
Task bstore_kv_sync in btrfs_commit_transaction waiting for ordered
extents:
__schedule+0x2e7/0x8a0
schedule+0x64/0xe0
btrfs_commit_transaction+0xbf7/0xda0 [btrfs]
btrfs_sync_file+0x342/0x4d0 [btrfs]
__x64_sys_fdatasync+0x4b/0x80
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Task kworker in wait_current_trans waiting for transaction commit:
Workqueue: btrfs-syno_nocow btrfs_work_helper [btrfs]
__schedule+0x2e7/0x8a0
schedule+0x64/0xe0
wait_current_trans+0xb0/0x110 [btrfs]
start_transaction+0x346/0x5b0 [btrfs]
btrfs_finish_ordered_io.isra.0+0x49b/0x9c0 [btrfs]
btrfs_work_helper+0xe8/0x350 [btrfs]
process_one_work+0x1d3/0x3c0
worker_thread+0x4d/0x3e0
kthread+0x12d/0x150
ret_from_fork+0x1f/0x30
Fix this by passing the transaction type to wait_current_trans() and
checking btrfs_blocked_trans_types[cur_trans->state] against the given
type before deciding to wait. This ensures that transaction types which
are allowed to join during certain blocked states will not unnecessarily
wait and cause deadlocks.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cc: Motiejus Jakštys <motiejus@jakstys.lt>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a11224a016 ]
In create_space_info(), the 'space_info' object is allocated at the
beginning of the function. However, there are two error paths where the
function returns an error code without freeing the allocated memory:
1. When create_space_info_sub_group() fails in zoned mode.
2. When btrfs_sysfs_add_space_info_type() fails.
In both cases, 'space_info' has not yet been added to the
fs_info->space_info list, resulting in a memory leak. Fix this by
adding an error handling label to kfree(space_info) before returning.
Fixes: 2be12ef79f ("btrfs: Separate space_info create/update")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f92ee31e03 ]
Current code assumes we have only one space_info for each block group type
(DATA, METADATA, and SYSTEM). We sometime need multiple space infos to
manage special block groups.
One example is handling the data relocation block group for the zoned mode.
That block group is dedicated for writing relocated data and we cannot
allocate any regular extent from that block group, which is implemented in
the zoned extent allocator. This block group still belongs to the normal
data space_info. So, when all the normal data block groups are full and
there is some free space in the dedicated block group, the space_info
looks to have some free space, while it cannot allocate normal extent
anymore. That results in a strange ENOSPC error. We need to have a
space_info for the relocation data block group to represent the situation
properly.
Adds a basic infrastructure for having a "sub-group" of a space_info:
creation and removing. A sub-group space_info belongs to one of the
primary space_infos and has the same flags as its parent.
This commit first introduces the relocation data sub-space_info, and the
next commit will introduce tree-log sub-space_info. In the future, it could
be useful to implement tiered storage for btrfs e.g. by implementing a
sub-group space_info for block groups resides on a fast storage.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: a11224a016 ("btrfs: fix memory leaks in create_space_info() error paths")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1cfdbe0d53 ]
Factor out check_removing_space_info() from btrfs_free_block_groups(). It
sanity checks a to-be-removed space_info. There is no functional change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: a11224a016 ("btrfs: fix memory leaks in create_space_info() error paths")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ac5578fef3 ]
Factor out initialization of the space_info struct, which is used in a
later patch. There is no functional change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: a11224a016 ("btrfs: fix memory leaks in create_space_info() error paths")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 08b096c137 ]
Before accessing the disk_bytenr field of a file extent item we need
to check if we are dealing with an inline extent.
This is because for inline extents their data starts at the offset of
the disk_bytenr field. So accessing the disk_bytenr
means we are accessing inline data or in case the inline data is less
than 8 bytes we can actually cause an invalid
memory access if this inline extent item is the first item in the leaf
or access metadata from other items.
Fixes: 82bfb2e7b6 ("Btrfs: incremental send, fix unnecessary hole writes for sparse files")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e9e3b22ddf ]
[BUG]
For the following write sequence with 64K page size and 4K fs block size,
it will lead to file extent items to be inserted without any data
checksum:
mkfs.btrfs -s 4k -f $dev > /dev/null
mount $dev $mnt
xfs_io -f -c "pwrite 0 16k" -c "pwrite 32k 4k" -c pwrite "60k 64K" \
-c "truncate 16k" $mnt/foobar
umount $mnt
This will result the following 2 file extent items to be inserted (extra
trace point added to insert_ordered_extent_file_extent()):
btrfs_finish_one_ordered: root=5 ino=257 file_off=61440 num_bytes=4096 csum_bytes=0
btrfs_finish_one_ordered: root=5 ino=257 file_off=0 num_bytes=16384 csum_bytes=16384
Note for file offset 60K, we're inserting a file extent without any
data checksum.
Also note that range [32K, 36K) didn't reach
insert_ordered_extent_file_extent(), which is the correct behavior as
that OE is fully truncated, should not result any file extent.
Although file extent at 60K will be later dropped by btrfs_truncate(),
if the transaction got committed after file extent inserted but before
the file extent dropping, we will have a small window where we have a
file extent beyond EOF and without any data checksum.
That will cause "btrfs check" to report error.
[CAUSE]
The sequence happens like this:
- Buffered write dirtied the page cache and updated isize
Now the inode size is 64K, with the following page cache layout:
0 16K 32K 48K 64K
|/////////////| |//| |//|
- Truncate the inode to 16K
Which will trigger writeback through:
btrfs_setsize()
|- truncate_setsize()
| Now the inode size is set to 16K
|
|- btrfs_truncate()
|- btrfs_wait_ordered_range() for [16K, u64(-1)]
|- btrfs_fdatawrite_range() for [16K, u64(-1)}
|- extent_writepage() for folio 0
|- writepage_delalloc()
| Generated OE for [0, 16K), [32K, 36K] and [60K, 64K)
|
|- extent_writepage_io()
Then inside extent_writepage_io(), the dirty fs blocks are handled
differently:
- Submit write for range [0, 16K)
As they are still inside the inode size (16K).
- Mark OE [32K, 36K) as truncated
Since we only call btrfs_lookup_first_ordered_range() once, which
returned the first OE after file offset 16K.
- Mark all OEs inside range [16K, 64K) as finished
Which will mark OE ranges [32K, 36K) and [60K, 64K) as finished.
For OE [32K, 36K) since it's already marked as truncated, and its
truncated length is 0, no file extent will be inserted.
For OE [60K, 64K) it has never been submitted thus has no data
checksum, and we insert the file extent as usual.
This is the root cause of file extent at 60K to be inserted without
any data checksum.
- Clear dirty flags for range [16K, 64K)
It is the function btrfs_folio_clear_dirty() which searches and clears
any dirty blocks inside that range.
[FIX]
The bug itself was introduced a long time ago, way before subpage and
large folio support.
At that time, fs block size must match page size, thus the range
[cur, end) is just one fs block.
But later with subpage and large folios, the same range [cur, end)
can have multiple blocks and ordered extents.
Later commit 18de34daa7 ("btrfs: truncate ordered extent when skipping
writeback past i_size") was fixing a bug related to subpage/large
folios, but it's still utilizing the old range [cur, end), meaning only
the first OE will be marked as truncated.
The proper fix here is to make EOF handling block-by-block, not trying
to handle the whole range to @end.
By this we always locate and truncate the OE for every dirty block.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 18de34daa7 ]
While running test case btrfs/192 from fstests with support for large
folios (needs CONFIG_BTRFS_EXPERIMENTAL=y) I ended up getting very sporadic
btrfs check failures reporting that csum items were missing. Looking into
the issue it turned out that btrfs check searches for csum items of a file
extent item with a range that spans beyond the i_size of a file and we
don't have any, because the kernel's writeback code skips submitting bios
for ranges beyond eof. It's not expected however to find a file extent item
that crosses the rounded up (by the sector size) i_size value, but there is
a short time window where we can end up with a transaction commit leaving
this small inconsistency between the i_size and the last file extent item.
Example btrfs check output when this happens:
$ btrfs check /dev/sdc
Opening filesystem to check...
Checking filesystem on /dev/sdc
UUID: 69642c61-5efb-4367-aa31-cdfd4067f713
[1/8] checking log skipped (none written)
[2/8] checking root items
[3/8] checking extents
[4/8] checking free space tree
[5/8] checking fs roots
root 5 inode 332 errors 1000, some csum missing
ERROR: errors found in fs roots
(...)
Looking at a tree dump of the fs tree (root 5) for inode 332 we have:
$ btrfs inspect-internal dump-tree -t 5 /dev/sdc
(...)
item 28 key (332 INODE_ITEM 0) itemoff 2006 itemsize 160
generation 17 transid 19 size 610969 nbytes 86016
block group 0 mode 100666 links 1 uid 0 gid 0 rdev 0
sequence 11 flags 0x0(none)
atime 1759851068.391327881 (2025-10-07 16:31:08)
ctime 1759851068.410098267 (2025-10-07 16:31:08)
mtime 1759851068.410098267 (2025-10-07 16:31:08)
otime 1759851068.391327881 (2025-10-07 16:31:08)
item 29 key (332 INODE_REF 340) itemoff 1993 itemsize 13
index 2 namelen 3 name: f1f
item 30 key (332 EXTENT_DATA 589824) itemoff 1940 itemsize 53
generation 19 type 1 (regular)
extent data disk byte 21745664 nr 65536
extent data offset 0 nr 65536 ram 65536
extent compression 0 (none)
(...)
We can see that the file extent item for file offset 589824 has a length of
64K and its number of bytes is 64K. Looking at the inode item we see that
its i_size is 610969 bytes which falls within the range of that file extent
item [589824, 655360[.
Looking into the csum tree:
$ btrfs inspect-internal dump-tree /dev/sdc
(...)
item 15 key (EXTENT_CSUM EXTENT_CSUM 21565440) itemoff 991 itemsize 200
range start 21565440 end 21770240 length 204800
item 16 key (EXTENT_CSUM EXTENT_CSUM 1104576512) itemoff 983 itemsize 8
range start 1104576512 end 1104584704 length 8192
(..)
We see that the csum item number 15 covers the first 24K of the file extent
item - it ends at offset 21770240 and the extent's disk_bytenr is 21745664,
so we have:
21770240 - 21745664 = 24K
We see that the next csum item (number 16) is completely outside the range,
so the remaining 40K of the extent doesn't have csum items in the tree.
If we round up the i_size to the sector size, we get:
round_up(610969, 4096) = 614400
If we subtract from that the file offset for the extent item we get:
614400 - 589824 = 24K
So the missing 40K corresponds to the end of the file extent item's range
minus the rounded up i_size:
655360 - 614400 = 40K
Normally we don't expect a file extent item to span over the rounded up
i_size of an inode, since when truncating, doing hole punching and other
operations that trim a file extent item, the number of bytes is adjusted.
There is however a short time window where the kernel can end up,
temporarily,persisting an inode with an i_size that falls in the middle of
the last file extent item and the file extent item was not yet trimmed (its
number of bytes reduced so that it doesn't cross i_size rounded up by the
sector size).
The steps (in the kernel) that lead to such scenario are the following:
1) We have inode I as an empty file, no allocated extents, i_size is 0;
2) A buffered write is done for file range [589824, 655360[ (length of
64K) and the i_size is updated to 655360. Note that we got a single
large folio for the range (64K);
3) A truncate operation starts that reduces the inode's i_size down to
610969 bytes. The truncate sets the inode's new i_size at
btrfs_setsize() by calling truncate_setsize() and before calling
btrfs_truncate();
4) At btrfs_truncate() we trigger writeback for the range starting at
610304 (which is the new i_size rounded down to the sector size) and
ending at (u64)-1;
5) During the writeback, at extent_write_cache_pages(), we get from the
call to filemap_get_folios_tag(), the 64K folio that starts at file
offset 589824 since it contains the start offset of the writeback
range (610304);
6) At writepage_delalloc() we find the whole range of the folio is dirty
and therefore we run delalloc for that 64K range ([589824, 655360[),
reserving a 64K extent, creating an ordered extent, etc;
7) At extent_writepage_io() we submit IO only for subrange [589824, 614400[
because the inode's i_size is 610969 bytes (rounded up by sector size
is 614400). There, in the while loop we intentionally skip IO beyond
i_size to avoid any unnecessay work and just call
btrfs_mark_ordered_io_finished() for the range [614400, 655360[ (which
has a 40K length);
8) Once the IO finishes we finish the ordered extent by ending up at
btrfs_finish_one_ordered(), join transaction N, insert a file extent
item in the inode's subvolume tree for file offset 589824 with a number
of bytes of 64K, and update the inode's delayed inode item or directly
the inode item with a call to btrfs_update_inode_fallback(), which
results in storing the new i_size of 610969 bytes;
9) Transaction N is committed either by the transaction kthread or some
other task committed it (in response to a sync or fsync for example).
At this point we have inode I persisted with an i_size of 610969 bytes
and file extent item that starts at file offset 589824 and has a number
of bytes of 64K, ending at an offset of 655360 which is beyond the
i_size rounded up to the sector size (614400).
--> So after a crash or power failure here, the btrfs check program
reports that error about missing checksum items for this inode, as
it tries to lookup for checksums covering the whole range of the
extent;
10) Only after transaction N is committed that at btrfs_truncate() the
call to btrfs_start_transaction() starts a new transaction, N + 1,
instead of joining transaction N. And it's with transaction N + 1 that
it calls btrfs_truncate_inode_items() which updates the file extent
item at file offset 589824 to reduce its number of bytes from 64K down
to 24K, so that the file extent item's range ends at the i_size
rounded up to the sector size (614400 bytes).
Fix this by truncating the ordered extent at extent_writepage_io() when we
skip writeback because the current offset in the folio is beyond i_size.
This ensures we don't ever persist a file extent item with a number of
bytes beyond the rounded up (by sector size) value of the i_size.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddf ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 619611e87f ]
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.
To prepare for that future, here we do:
- Remove btrfs_fs_info::sectors_per_page
- Introduce a helper, btrfs_blocks_per_folio()
Which uses the folio size to calculate the number of blocks for each
folio.
- Migrate the existing btrfs_fs_info::sectors_per_page to use that
helper
There are some exceptions:
* Metadata nodesize < page size support
In the future, even if we support large folios, we will only
allocate a folio that matches our nodesize.
Thus we won't have a folio covering multiple metadata unless
nodesize < page size.
* Existing subpage bitmap dump
We use a single unsigned long to store the bitmap.
That means until we change the bitmap dumping code, our upper limit
for folio size will only be 256K (4K block size, 64 bit unsigned
long).
* btrfs_is_subpage() check
This will be migrated into a future patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddf ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 975a6a8855 ]
All the error handling bugs I hit so far are all -ENOSPC from either:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
Previously when those functions failed, there was no error message at
all, making the debugging much harder.
So here we introduce extra error messages for:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed
One example of the new debug error messages is the following one:
run fstests generic/750 at 2024-12-08 12:41:41
BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
BTRFS info (device dm-4): using free-space-tree
BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-4): checking UUID tree
BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28
Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddf ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 61d730731b ]
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
bitmap_test_range_all_zero() to ensure the involved range has no
dirty/lock bit already set.
However with my recent enhanced delalloc range error handling, I was
hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out
that some error handling path is not properly updating the folio flags.
So add some extra dumping for the ASSERTs to dump the involved bitmap
to help debug.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddf ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit a7858d5c36 ]
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().
But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.
In that case there are 3 bugs in the error handling:
1) Double freeing for the same ordered extent
This can lead to crash due to ordered extent double accounting
2) Start/end writeback without updating the subpage writeback bitmap
3) Unlock the folio without clear the subpage lock bitmap
Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().
[CAUSE]
Bug 1) happens in the following call chain:
submit_uncompressed_range()
|- run_delalloc_cow()
| |- cow_file_range()
| |- btrfs_reserve_extent()
| Failed with -ENOSPC or whatever error
|
|- btrfs_clean_up_ordered_extents()
| |- btrfs_mark_ordered_io_finished()
| Which cleans all the ordered extents in the async_extent range.
|
|- btrfs_mark_ordered_io_finished()
Which cleans the folio range.
The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.
So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.
Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.
Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.
[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.
And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().
So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.
For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.
For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: e9e3b22ddf ("btrfs: fix beyond-EOF write handling")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 30bcf4e824 ]
[BUG]
Since the introduction of btrfs bs < ps support, v1 cache was never on
the plan due to its hard coded PAGE_SIZE usage, and the future plan to
properly deprecate it.
However for bs < ps cases, even if 'nospace_cache,clear_cache' mount
option is specified, it's never respected and free space tree is always
enabled:
mkfs.btrfs -f -O ^bgt,fst $dev
mount $dev $mnt -o clear_cache,nospace_cache
umount $mnt
btrfs ins dump-super $dev
...
compat_ro_flags 0x3
( FREE_SPACE_TREE |
FREE_SPACE_TREE_VALID )
...
This means a different behavior compared to bs >= ps cases.
[CAUSE]
The forcing usage of v2 space cache is done inside
btrfs_set_free_space_cache_settings(), however it never checks if we're
even using space cache but always enabling v2 cache.
[FIX]
Instead unconditionally enable v2 cache, only forcing v2 cache if the
old v1 cache is required.
Now v2 space cache can be properly disabled on bs < ps cases:
mkfs.btrfs -f -O ^bgt,fst $dev
mount $dev $mnt -o clear_cache,nospace_cache
umount $mnt
btrfs ins dump-super $dev
...
compat_ro_flags 0x0
...
Fixes: 9f73f1aef9 ("btrfs: force v2 space cache usage for subpage mount")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 68d4b3fa18 ]
[BUG]
There is a bug that if a subvolume has multi-level parent qgroups, and
is able to do a quick inherit, only the direct parent qgroup got
updated:
mkfs.btrfs -f -O quota $dev
mount $dev $mnt
btrfs subv create $mnt/subv1
btrfs qgroup create 1/100 $mnt
btrfs qgroup create 2/100 $mnt
btrfs qgroup assign 1/100 2/100 $mnt
btrfs qgroup assign 0/256 1/100 $mnt
btrfs qgroup show -p --sync $mnt
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
1/100 16.00KiB 16.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
btrfs subv snap -i 1/100 $mnt/subv1 $mnt/snap1
btrfs qgroup show -p --sync $mnt
Qgroupid Referenced Exclusive Parent Path
-------- ---------- --------- ------ ----
0/5 16.00KiB 16.00KiB - <toplevel>
0/256 16.00KiB 16.00KiB 1/100 subv1
0/257 16.00KiB 16.00KiB 1/100 snap1
1/100 32.00KiB 32.00KiB 2/100 2/100<1 member qgroup>
2/100 16.00KiB 16.00KiB - <0 member qgroups>
# Note that 2/100 is not updated, and qgroup numbers are inconsistent
umount $mnt
[CAUSE]
If the snapshot source subvolume belongs to a parent qgroup, and the new
snapshot target is also added to the new same parent qgroup, we allow a
quick update without marking qgroup inconsistent.
But that quick update only update the parent qgroup, without checking if
there is any more parent qgroups.
[FIX]
Iterate through all parent qgroups during the quick inherit.
Reported-by: Boris Burkov <boris@bur.io>
Fixes: b20fe56cd2 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7ee19a59a7 ]
qgroup_snapshot_quick_inherit() detects conditions where the snapshot
destination would land in the same parent qgroup as the snapshot source
subvolume. In this case we can avoid costly qgroup calculations and just
add the nodesize of the new snapshot to the parent.
However, in the case of squotas this is actually a double count, and
also an undercount for deeper qgroup nestings.
The following annotated script shows the issue:
btrfs quota enable --simple "$mnt"
# Create 2-level qgroup hierarchy
btrfs qgroup create 2/100 "$mnt" # Q2 (level 2)
btrfs qgroup create 1/100 "$mnt" # Q1 (level 1)
btrfs qgroup assign 1/100 2/100 "$mnt"
# Create base subvolume
btrfs subvolume create "$mnt/base" >/dev/null
base_id=$(btrfs subvolume show "$mnt/base" | grep 'Subvolume ID:' | awk '{print $3}')
# Create intermediate snapshot and add to Q1
btrfs subvolume snapshot "$mnt/base" "$mnt/intermediate" >/dev/null
inter_id=$(btrfs subvolume show "$mnt/intermediate" | grep 'Subvolume ID:' | awk '{print $3}')
btrfs qgroup assign "0/$inter_id" 1/100 "$mnt"
# Create working snapshot with --inherit (auto-adds to Q1)
# src=intermediate (in only Q1)
# dst=snap (inheriting only into Q1)
# This double counts the 16k nodesize of the snapshot in Q1, and
# undercounts it in Q2.
btrfs subvolume snapshot -i 1/100 "$mnt/intermediate" "$mnt/snap" >/dev/null
snap_id=$(btrfs subvolume show "$mnt/snap" | grep 'Subvolume ID:' | awk '{print $3}')
# Fully complete snapshot creation
sync
# Delete working snapshot
# Q1 and Q2 will lose the full snap usage
btrfs subvolume delete "$mnt/snap" >/dev/null
# Delete intermediate and remove from Q1
# Q1 and Q2 will lose the full intermediate usage
btrfs qgroup remove "0/$inter_id" 1/100 "$mnt"
btrfs subvolume delete "$mnt/intermediate" >/dev/null
# Q1 should be at 0, but still has 16k. Q2 is "correct" at 0 (for now...)
# Trigger cleaner, wait for deletions
mount -o remount,sync=1 "$mnt"
btrfs subvolume sync "$mnt" "$snap_id"
btrfs subvolume sync "$mnt" "$inter_id"
# Remove Q1 from Q2
# Frees 16k more from Q2, underflowing it to 16EiB
btrfs qgroup remove 1/100 2/100 "$mnt"
# And show the bad state:
btrfs qgroup show -pc "$mnt"
Qgroupid Referenced Exclusive Parent Child Path
-------- ---------- --------- ------ ----- ----
0/5 16.00KiB 16.00KiB - - <toplevel>
0/256 16.00KiB 16.00KiB - - base
1/100 16.00KiB 16.00KiB - - <0 member qgroups>
2/100 16.00EiB 16.00EiB - - <0 member qgroups>
Fix this by simply not doing this quick inheritance with squotas.
I suspect that it is also wrong in normal qgroups to not recurse up the
qgroup tree in the quick inherit case, though other consistency checks
will likely fix it anyway.
Fixes: b20fe56cd2 ("btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7ba0b6461b upstream.
After rename exchanging (either with the rename exchange operation or
regular renames in multiple non-atomic steps) two inodes and at least
one of them is a directory, we can end up with a log tree that contains
only of the inodes and after a power failure that can result in an attempt
to delete the other inode when it should not because it was not deleted
before the power failure. In some case that delete attempt fails when
the target inode is a directory that contains a subvolume inside it, since
the log replay code is not prepared to deal with directory entries that
point to root items (only inode items).
1) We have directories "dir1" (inode A) and "dir2" (inode B) under the
same parent directory;
2) We have a file (inode C) under directory "dir1" (inode A);
3) We have a subvolume inside directory "dir2" (inode B);
4) All these inodes were persisted in a past transaction and we are
currently at transaction N;
5) We rename the file (inode C), so at btrfs_log_new_name() we update
inode C's last_unlink_trans to N;
6) We get a rename exchange for "dir1" (inode A) and "dir2" (inode B),
so after the exchange "dir1" is inode B and "dir2" is inode A.
During the rename exchange we call btrfs_log_new_name() for inodes
A and B, but because they are directories, we don't update their
last_unlink_trans to N;
7) An fsync against the file (inode C) is done, and because its inode
has a last_unlink_trans with a value of N we log its parent directory
(inode A) (through btrfs_log_all_parents(), called from
btrfs_log_inode_parent()).
8) So we end up with inode B not logged, which now has the old name
of inode A. At copy_inode_items_to_log(), when logging inode A, we
did not check if we had any conflicting inode to log because inode
A has a generation lower than the current transaction (created in
a past transaction);
9) After a power failure, when replaying the log tree, since we find that
inode A has a new name that conflicts with the name of inode B in the
fs tree, we attempt to delete inode B... this is wrong since that
directory was never deleted before the power failure, and because there
is a subvolume inside that directory, attempting to delete it will fail
since replay_dir_deletes() and btrfs_unlink_inode() are not prepared
to deal with dir items that point to roots instead of inodes.
When that happens the mount fails and we get a stack trace like the
following:
[87.2314] BTRFS info (device dm-0): start tree-log replay
[87.2318] BTRFS critical (device dm-0): failed to delete reference to subvol, root 5 inode 256 parent 259
[87.2332] ------------[ cut here ]------------
[87.2338] BTRFS: Transaction aborted (error -2)
[87.2346] WARNING: CPU: 1 PID: 638968 at fs/btrfs/inode.c:4345 __btrfs_unlink_inode+0x416/0x440 [btrfs]
[87.2368] Modules linked in: btrfs loop dm_thin_pool (...)
[87.2470] CPU: 1 UID: 0 PID: 638968 Comm: mount Tainted: G W 6.18.0-rc7-btrfs-next-218+ #2 PREEMPT(full)
[87.2489] Tainted: [W]=WARN
[87.2494] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[87.2514] RIP: 0010:__btrfs_unlink_inode+0x416/0x440 [btrfs]
[87.2538] Code: c0 89 04 24 (...)
[87.2568] RSP: 0018:ffffc0e741f4b9b8 EFLAGS: 00010286
[87.2574] RAX: 0000000000000000 RBX: ffff9d3ec8a6cf60 RCX: 0000000000000000
[87.2582] RDX: 0000000000000002 RSI: ffffffff84ab45a1 RDI: 00000000ffffffff
[87.2591] RBP: ffff9d3ec8a6ef20 R08: 0000000000000000 R09: ffffc0e741f4b840
[87.2599] R10: ffff9d45dc1fffa8 R11: 0000000000000003 R12: ffff9d3ee26d77e0
[87.2608] R13: ffffc0e741f4ba98 R14: ffff9d4458040800 R15: ffff9d44b6b7ca10
[87.2618] FS: 00007f7b9603a840(0000) GS:ffff9d4658982000(0000) knlGS:0000000000000000
[87.2629] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[87.2637] CR2: 00007ffc9ec33b98 CR3: 000000011273e003 CR4: 0000000000370ef0
[87.2648] Call Trace:
[87.2651] <TASK>
[87.2654] btrfs_unlink_inode+0x15/0x40 [btrfs]
[87.2661] unlink_inode_for_log_replay+0x27/0xf0 [btrfs]
[87.2669] check_item_in_log+0x1ea/0x2c0 [btrfs]
[87.2676] replay_dir_deletes+0x16b/0x380 [btrfs]
[87.2684] fixup_inode_link_count+0x34b/0x370 [btrfs]
[87.2696] fixup_inode_link_counts+0x41/0x160 [btrfs]
[87.2703] btrfs_recover_log_trees+0x1ff/0x7c0 [btrfs]
[87.2711] ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
[87.2719] open_ctree+0x10bb/0x15f0 [btrfs]
[87.2726] btrfs_get_tree.cold+0xb/0x16c [btrfs]
[87.2734] ? fscontext_read+0x15c/0x180
[87.2740] ? rw_verify_area+0x50/0x180
[87.2746] vfs_get_tree+0x25/0xd0
[87.2750] vfs_cmd_create+0x59/0xe0
[87.2755] __do_sys_fsconfig+0x4f6/0x6b0
[87.2760] do_syscall_64+0x50/0x1220
[87.2764] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[87.2770] RIP: 0033:0x7f7b9625f4aa
[87.2775] Code: 73 01 c3 48 (...)
[87.2803] RSP: 002b:00007ffc9ec35b08 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
[87.2817] RAX: ffffffffffffffda RBX: 0000558bfa91ac20 RCX: 00007f7b9625f4aa
[87.2829] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
[87.2842] RBP: 0000558bfa91b120 R08: 0000000000000000 R09: 0000000000000000
[87.2854] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[87.2864] R13: 00007f7b963f1580 R14: 00007f7b963f326c R15: 00007f7b963d8a23
[87.2877] </TASK>
[87.2882] ---[ end trace 0000000000000000 ]---
[87.2891] BTRFS: error (device dm-0 state A) in __btrfs_unlink_inode:4345: errno=-2 No such entry
[87.2904] BTRFS: error (device dm-0 state EAO) in do_abort_log_replay:191: errno=-2 No such entry
[87.2915] BTRFS critical (device dm-0 state EAO): log tree (for root 5) leaf currently being processed (slot 7 key (258 12 257)):
[87.2929] BTRFS info (device dm-0 state EAO): leaf 30736384 gen 10 total ptrs 7 free space 15712 owner 18446744073709551610
[87.2929] BTRFS info (device dm-0 state EAO): refs 3 lock_owner 0 current 638968
[87.2929] item 0 key (257 INODE_ITEM 0) itemoff 16123 itemsize 160
[87.2929] inode generation 9 transid 10 size 0 nbytes 0
[87.2929] block group 0 mode 40755 links 1 uid 0 gid 0
[87.2929] rdev 0 sequence 7 flags 0x0
[87.2929] atime 1765464494.678070921
[87.2929] ctime 1765464494.686606513
[87.2929] mtime 1765464494.686606513
[87.2929] otime 1765464494.678070921
[87.2929] item 1 key (257 INODE_REF 256) itemoff 16109 itemsize 14
[87.2929] index 4 name_len 4
[87.2929] item 2 key (257 DIR_LOG_INDEX 2) itemoff 16101 itemsize 8
[87.2929] dir log end 2
[87.2929] item 3 key (257 DIR_LOG_INDEX 3) itemoff 16093 itemsize 8
[87.2929] dir log end 18446744073709551615
[87.2930] item 4 key (257 DIR_INDEX 3) itemoff 16060 itemsize 33
[87.2930] location key (258 1 0) type 1
[87.2930] transid 10 data_len 0 name_len 3
[87.2930] item 5 key (258 INODE_ITEM 0) itemoff 15900 itemsize 160
[87.2930] inode generation 9 transid 10 size 0 nbytes 0
[87.2930] block group 0 mode 100644 links 1 uid 0 gid 0
[87.2930] rdev 0 sequence 2 flags 0x0
[87.2930] atime 1765464494.678456467
[87.2930] ctime 1765464494.686606513
[87.2930] mtime 1765464494.678456467
[87.2930] otime 1765464494.678456467
[87.2930] item 6 key (258 INODE_REF 257) itemoff 15887 itemsize 13
[87.2930] index 3 name_len 3
[87.2930] BTRFS critical (device dm-0 state EAO): log replay failed in unlink_inode_for_log_replay:1045 for root 5, stage 3, with error -2: failed to unlink inode 256 parent dir 259 name subvol root 5
[87.2963] BTRFS: error (device dm-0 state EAO) in btrfs_recover_log_trees:7743: errno=-2 No such entry
[87.2981] BTRFS: error (device dm-0 state EAO) in btrfs_replay_log:2083: errno=-2 No such entry (Failed to recover log tr
So fix this by changing copy_inode_items_to_log() to always detect if
there are conflicting inodes for the ref/extref of the inode being logged
even if the inode was created in a past transaction.
A test case for fstests will follow soon.
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 54df8b80cc ]
[BUG]
When a scrub failed immediately without any byte scrubbed, the returned
btrfs_scrub_progress::last_physical will always be 0, even if there is a
non-zero @start passed into btrfs_scrub_dev() for resume cases.
This will reset the progress and make later scrub resume start from the
beginning.
[CAUSE]
The function btrfs_scrub_dev() accepts a @progress parameter to copy its
updated progress to the caller, there are cases where we either don't
touch progress::last_physical at all or copy 0 into last_physical:
- last_physical not updated at all
If some error happened before scrubbing any super block or chunk, we
will not copy the progress, leaving the @last_physical untouched.
E.g. failed to allocate @sctx, scrubbing a missing device or even
there is already a running scrub and so on.
All those cases won't touch @progress at all, resulting the
last_physical untouched and will be left as 0 for most cases.
- Error out before scrubbing any bytes
In those case we allocated @sctx, and sctx->stat.last_physical is all
zero (initialized by kvzalloc()).
Unfortunately some critical errors happened during
scrub_enumerate_chunks() or scrub_supers() before any stripe is really
scrubbed.
In that case although we will copy sctx->stat back to @progress, since
no byte is really scrubbed, last_physical will be overwritten to 0.
[FIX]
Make sure the parameter @progress always has its @last_physical member
updated to @start parameter inside btrfs_scrub_dev().
At the very beginning of the function, set @progress->last_physical to
@start, so that even if we error out without doing progress copying,
last_physical is still at @start.
Then after we got @sctx allocated, set sctx->stat.last_physical to
@start, this will make sure even if we didn't get any byte scrubbed, at
the progress copying stage the @last_physical is not left as zero.
This should resolve the resume progress reset problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>