[ Upstream commit 7633a7387e ]
During SPO tests, when mounting F2FS, an -EINVAL error was returned from
f2fs_recover_inode_page. The issue occurred under the following scenario
Thread A Thread B
f2fs_ioc_commit_atomic_write
- f2fs_do_sync_file // atomic = true
- f2fs_fsync_node_pages
: last_folio = inode folio
: schedule before folio_lock(last_folio) f2fs_write_checkpoint
- block_operations// writeback last_folio
- schedule before f2fs_flush_nat_entries
: set_fsync_mark(last_folio, 1)
: set_dentry_mark(last_folio, 1)
: folio_mark_dirty(last_folio)
- __write_node_folio(last_folio)
: f2fs_down_read(&sbi->node_write)//block
- f2fs_flush_nat_entries
: {struct nat_entry}->flag |= BIT(IS_CHECKPOINTED)
- unblock_operations
: f2fs_up_write(&sbi->node_write)
f2fs_write_checkpoint//return
: f2fs_do_write_node_page()
f2fs_ioc_commit_atomic_write//return
SPO
Thread A calls f2fs_need_dentry_mark(sbi, ino), and the last_folio has
already been written once. However, the {struct nat_entry}->flag did not
have the IS_CHECKPOINTED set, causing set_dentry_mark(last_folio, 1) and
write last_folio again after Thread B finishes f2fs_write_checkpoint.
After SPO and reboot, it was detected that {struct node_info}->blk_addr
was not NULL_ADDR because Thread B successfully write the checkpoint.
This issue only occurs in atomic write scenarios. For regular file
fsync operations, the folio must be dirty. If
block_operations->f2fs_sync_node_pages successfully submit the folio
write, this path will not be executed. Otherwise, the
f2fs_write_checkpoint will need to wait for the folio write submission
to complete, as sbi->nr_pages[F2FS_DIRTY_NODES] > 0. Therefore, the
situation where f2fs_need_dentry_mark checks that the {struct
nat_entry}->flag /wo the IS_CHECKPOINTED flag, but the folio write has
already been submitted, will not occur.
Therefore, for atomic file fsync, sbi->node_write should be acquired
through __write_node_folio to ensure that the IS_CHECKPOINTED flag
correctly indicates that the checkpoint write has been completed.
Fixes: 608514deba ("f2fs: set fsync mark only for the last dnode")
Cc: stable@kernel.org
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[ folio => page ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ce2739e482 ]
As syzbot reported an use-after-free issue in f2fs_write_end_io().
It is caused by below race condition:
loop device umount
- worker_thread
- loop_process_work
- do_req_filebacked
- lo_rw_aio
- lo_rw_aio_complete
- blk_mq_end_request
- blk_update_request
- f2fs_write_end_io
- dec_page_count
- folio_end_writeback
- kill_f2fs_super
- kill_block_super
- f2fs_put_super
: free(sbi)
: get_pages(, F2FS_WB_CP_DATA)
accessed sbi which is freed
In kill_f2fs_super(), we will drop all page caches of f2fs inodes before
call free(sbi), it guarantee that all folios should end its writeback, so
it should be safe to access sbi before last folio_end_writeback().
Let's relocate ckpt thread wakeup flow before folio_end_writeback() to
resolve this issue.
Cc: stable@kernel.org
Fixes: e234088758 ("f2fs: avoid wait if IO end up when do_checkpoint for better performance")
Reported-by: syzbot+b4444e3c972a7a124187@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b4444e3c972a7a124187
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[ folio => page ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 5c145c0318 upstream.
Xiaolong Guo reported a f2fs bug in bugzilla [1]
[1] https://bugzilla.kernel.org/show_bug.cgi?id=220951
Quoted:
"When using stress-ng's swap stress test on F2FS filesystem with kernel 6.6+,
the system experiences data corruption leading to either:
1 dm-verity corruption errors and device reboot
2 F2FS node corruption errors and boot hangs
The issue occurs specifically when:
1 Using F2FS filesystem (ext4 is unaffected)
2 Swapfile size is less than F2FS section size (2MB)
3 Swapfile has fragmented physical layout (multiple non-contiguous extents)
4 Kernel version is 6.6+ (6.1 is unaffected)
The root cause is in check_swap_activate() function in fs/f2fs/data.c. When the
first extent of a small swapfile (< 2MB) is not aligned to section boundaries,
the function incorrectly treats it as the last extent, failing to map
subsequent extents. This results in incorrect swap_extent creation where only
the first extent is mapped, causing subsequent swap writes to overwrite wrong
physical locations (other files' data).
Steps to Reproduce
1 Setup a device with F2FS-formatted userdata partition
2 Compile stress-ng from https://github.com/ColinIanKing/stress-ng
3 Run swap stress test: (Android devices)
adb shell "cd /data/stressng; ./stress-ng-64 --metrics-brief --timeout 60
--swap 0"
Log:
1 Ftrace shows in kernel 6.6, only first extent is mapped during second
f2fs_map_blocks call in check_swap_activate():
stress-ng-swap-8990: f2fs_map_blocks: ino=11002, file offset=0, start
blkaddr=0x43143, len=0x1
(Only 4KB mapped, not the full swapfile)
2 in kernel 6.1, both extents are correctly mapped:
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=0, start
blkaddr=0x13cd4, len=0x1
stress-ng-swap-5966: f2fs_map_blocks: ino=28011, file offset=1, start
blkaddr=0x60c84b, len=0xff
The problematic code is in check_swap_activate():
if ((pblock - SM_I(sbi)->main_blkaddr) % blks_per_sec ||
nr_pblocks % blks_per_sec ||
!f2fs_valid_pinned_area(sbi, pblock)) {
bool last_extent = false;
not_aligned++;
nr_pblocks = roundup(nr_pblocks, blks_per_sec);
if (cur_lblock + nr_pblocks > sis->max)
nr_pblocks -= blks_per_sec;
/* this extent is last one */
if (!nr_pblocks) {
nr_pblocks = last_lblock - cur_lblock;
last_extent = true;
}
ret = f2fs_migrate_blocks(inode, cur_lblock, nr_pblocks);
if (ret) {
if (ret == -ENOENT)
ret = -EINVAL;
goto out;
}
if (!last_extent)
goto retry;
}
When the first extent is unaligned and roundup(nr_pblocks, blks_per_sec)
exceeds sis->max, we subtract blks_per_sec resulting in nr_pblocks = 0. The
code then incorrectly assumes this is the last extent, sets nr_pblocks =
last_lblock - cur_lblock (entire swapfile), and performs migration. After
migration, it doesn't retry mapping, so subsequent extents are never processed.
"
In order to fix this issue, we need to lookup block mapping info after
we migrate all blocks in the tail of swapfile.
Cc: stable@kernel.org
Fixes: 9703d69d9d ("f2fs: support file pinning for zoned devices")
Cc: Daeho Jeong <daehojeong@google.com>
Reported-and-tested-by: Xiaolong Guo <guoxiaolong2008@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220951
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 98ea0039db upstream.
Some f2fs sysfs attributes suffer from out-of-bounds memory access and
incorrect handling of integer values whose size is not 4 bytes.
For example:
vm:~# echo 65537 > /sys/fs/f2fs/vde/carve_out
vm:~# cat /sys/fs/f2fs/vde/carve_out
65537
vm:~# echo 4294967297 > /sys/fs/f2fs/vde/atgc_age_threshold
vm:~# cat /sys/fs/f2fs/vde/atgc_age_threshold
1
carve_out maps to {struct f2fs_sb_info}->carve_out, which is a 8-bit
integer. However, the sysfs interface allows setting it to a value
larger than 255, resulting in an out-of-range update.
atgc_age_threshold maps to {struct atgc_management}->age_threshold,
which is a 64-bit integer, but its sysfs interface cannot correctly set
values larger than UINT_MAX.
The root causes are:
1. __sbi_store() treats all default values as unsigned int, which
prevents updating integers larger than 4 bytes and causes out-of-bounds
writes for integers smaller than 4 bytes.
2. f2fs_sbi_show() also assumes all default values are unsigned int,
leading to out-of-bounds reads and incorrect access to integers larger
than 4 bytes.
This patch introduces {struct f2fs_attr}->size to record the actual size
of the integer associated with each sysfs attribute. With this
information, sysfs read and write operations can correctly access and
update values according to their real data size, avoiding memory
corruption and truncation.
Fixes: b59d0bae6c ("f2fs: add sysfs support for controlling the gc_thread")
Cc: stable@kernel.org
Signed-off-by: Jinbao Liu <liujinbao1@xiaomi.com>
Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4bc3477796 ]
During f2fs_enable_checkpoint() in remount(), if we flush a large
amount of dirty pages into slow device, it may take long time which
will block write IO, let's add a timeout machanism during dirty
pages flush to avoid long time block in f2fs_enable_checkpoint().
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: be112e7449 ("f2fs: fix to propagate error from f2fs_enable_checkpoint()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f88c7904b5 ]
SBI_POR_DOING can be cleared after recovery is completed, so that
changes made before recovery can be persistent, and subsequent
errors can be recorded into cp/sb.
Signed-off-by: Song Feng <songfeng@oppo.com>
Signed-off-by: Yongpeng Yang <yangyongpeng1@oppo.com>
Signed-off-by: Sheng Yong <shengyong@oppo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: be112e7449 ("f2fs: fix to propagate error from f2fs_enable_checkpoint()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 078cad8212 ]
Let's drop the inode from the donation list when there is no other
open file.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 10b591e7fb ("f2fs: fix to avoid updating compression context during writeback")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 1f27ef42bb ]
As Hong Yun reported in mailing list:
loop7: detected capacity change from 0 to 131072
------------[ cut here ]------------
kmem_cache of name 'f2fs_xattr_entry-7:7' already exists
WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 kmem_cache_sanity_check mm/slab_common.c:109 [inline]
WARNING: CPU: 0 PID: 24426 at mm/slab_common.c:110 __kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307
CPU: 0 UID: 0 PID: 24426 Comm: syz.7.1370 Not tainted 6.17.0-rc4 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:kmem_cache_sanity_check mm/slab_common.c:109 [inline]
RIP: 0010:__kmem_cache_create_args+0xa6/0x320 mm/slab_common.c:307
Call Trace:
__kmem_cache_create include/linux/slab.h:353 [inline]
f2fs_kmem_cache_create fs/f2fs/f2fs.h:2943 [inline]
f2fs_init_xattr_caches+0xa5/0xe0 fs/f2fs/xattr.c:843
f2fs_fill_super+0x1645/0x2620 fs/f2fs/super.c:4918
get_tree_bdev_flags+0x1fb/0x260 fs/super.c:1692
vfs_get_tree+0x43/0x140 fs/super.c:1815
do_new_mount+0x201/0x550 fs/namespace.c:3808
do_mount fs/namespace.c:4136 [inline]
__do_sys_mount fs/namespace.c:4347 [inline]
__se_sys_mount+0x298/0x2f0 fs/namespace.c:4324
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x8e/0x3a0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The bug can be reproduced w/ below scripts:
- mount /dev/vdb /mnt1
- mount /dev/vdc /mnt2
- umount /mnt1
- mounnt /dev/vdb /mnt1
The reason is if we created two slab caches, named f2fs_xattr_entry-7:3
and f2fs_xattr_entry-7:7, and they have the same slab size. Actually,
slab system will only create one slab cache core structure which has
slab name of "f2fs_xattr_entry-7:3", and two slab caches share the same
structure and cache address.
So, if we destroy f2fs_xattr_entry-7:3 cache w/ cache address, it will
decrease reference count of slab cache, rather than release slab cache
entirely, since there is one more user has referenced the cache.
Then, if we try to create slab cache w/ name "f2fs_xattr_entry-7:3" again,
slab system will find that there is existed cache which has the same name
and trigger the warning.
Let's changes to use global inline_xattr_slab instead of per-sb slab cache
for fixing.
Fixes: a999150f4f ("f2fs: use kmem_cache pool during inline xattr lookups")
Cc: stable@kernel.org
Reported-by: Hong Yun <yhong@link.cuhk.edu.hk>
Tested-by: Hong Yun <yhong@link.cuhk.edu.hk>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[ folio => page ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 68d05693f8 ]
mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync # avoid CP_UMOUNT_FLAG in last f2fs_checkpoint.ckpt_flags
touch /mnt/f2fs/bar
f2fs_io fsync /mnt/f2fs/bar
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
blockdev --setro /dev/vdd
mount /dev/vdd /mnt/f2fs
mount: /mnt/f2fs: WARNING: source write-protected, mounted read-only.
For the case if we create and fsync a new inode before sudden power-cut,
without norecovery or disable_roll_forward mount option, the following
mount will succeed w/o recovering last fsynced inode.
The problem here is that we only check inode_list list after
find_fsync_dnodes() in f2fs_recover_fsync_data() to find out whether
there is recoverable data in the iamge, but there is a missed case, if
last fsynced inode is not existing in last checkpoint, then, we will
fail to get its inode due to nat of inode node is not existing in last
checkpoint, so the inode won't be linked in inode_list.
Let's detect such case in dyrun mode to fix this issue.
After this change, mount will fail as expected below:
mount: /mnt/f2fs: cannot mount /dev/vdd read-only.
dmesg(1) may have more information after failed mount system call.
demsg:
F2FS-fs (vdd): Need to recover fsync data, but write access unavailable, please try mount w/ disable_roll_forward or norecovery
Cc: stable@kernel.org
Fixes: 6781eabba1 ("f2fs: give -EINVAL for norecovery and rw mount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[ folio => page ]
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 01fba45dea upstream.
With below scripts, it will trigger panic in f2fs:
mkfs.f2fs -f /dev/vdd
mount /dev/vdd /mnt/f2fs
touch /mnt/f2fs/foo
sync
echo 111 >> /mnt/f2fs/foo
f2fs_io fsync /mnt/f2fs/foo
f2fs_io shutdown 2 /mnt/f2fs
umount /mnt/f2fs
mount -o ro,norecovery /dev/vdd /mnt/f2fs
or
mount -o ro,disable_roll_forward /dev/vdd /mnt/f2fs
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 0
F2FS-fs (vdd): Mounted with checkpoint version = 7f5c361f
F2FS-fs (vdd): Stopped filesystem due to reason: 0
F2FS-fs (vdd): f2fs_recover_fsync_data: recovery fsync data, check_only: 1
Filesystem f2fs get_tree() didn't set fc->root, returned 1
------------[ cut here ]------------
kernel BUG at fs/super.c:1761!
Oops: invalid opcode: 0000 [#1] SMP PTI
CPU: 3 UID: 0 PID: 722 Comm: mount Not tainted 6.18.0-rc2+ #721 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:vfs_get_tree.cold+0x18/0x1a
Call Trace:
<TASK>
fc_mount+0x13/0xa0
path_mount+0x34e/0xc50
__x64_sys_mount+0x121/0x150
do_syscall_64+0x84/0x800
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fa6cc126cfe
The root cause is we missed to handle error number returned from
f2fs_recover_fsync_data() when mounting image w/ ro,norecovery or
ro,disable_roll_forward mount option, result in returning a positive
error number to vfs_get_tree(), fix it.
Cc: stable@kernel.org
Fixes: 6781eabba1 ("f2fs: give -EINVAL for norecovery and rw mount")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 392711ef18 upstream.
The one_time_gc field in struct victim_sel_policy is conditionally
initialized but unconditionally read, leading to undefined behavior
that triggers UBSAN warnings.
In f2fs_get_victim() at fs/f2fs/gc.c:774, the victim_sel_policy
structure is declared without initialization:
struct victim_sel_policy p;
The field p.one_time_gc is only assigned when the 'one_time' parameter
is true (line 789):
if (one_time) {
p.one_time_gc = one_time;
...
}
However, this field is unconditionally read in subsequent get_gc_cost()
at line 395:
if (p->one_time_gc && (valid_thresh_ratio < 100) && ...)
When one_time is false, p.one_time_gc contains uninitialized stack
memory. Hence p.one_time_gc is an invalid bool value.
UBSAN detects this invalid bool value:
UBSAN: invalid-load in fs/f2fs/gc.c:395:7
load of value 77 is not a valid value for type '_Bool'
CPU: 3 UID: 0 PID: 1297 Comm: f2fs_gc-252:16 Not tainted 6.18.0-rc3
#5 PREEMPT(voluntary)
Hardware name: OpenStack Foundation OpenStack Nova,
BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x70/0x90
dump_stack+0x14/0x20
__ubsan_handle_load_invalid_value+0xb3/0xf0
? dl_server_update+0x2e/0x40
? update_curr+0x147/0x170
f2fs_get_victim.cold+0x66/0x134 [f2fs]
? sched_balance_newidle+0x2ca/0x470
? finish_task_switch.isra.0+0x8d/0x2a0
f2fs_gc+0x2ba/0x8e0 [f2fs]
? _raw_spin_unlock_irqrestore+0x12/0x40
? __timer_delete_sync+0x80/0xe0
? timer_delete_sync+0x14/0x20
? schedule_timeout+0x82/0x100
gc_thread_func+0x38b/0x860 [f2fs]
? gc_thread_func+0x38b/0x860 [f2fs]
? __pfx_autoremove_wake_function+0x10/0x10
kthread+0x10b/0x220
? __pfx_gc_thread_func+0x10/0x10 [f2fs]
? _raw_spin_unlock_irq+0x12/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork+0x11a/0x160
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This issue is reliably reproducible with the following steps on a
100GB SSD /dev/vdb:
mkfs.f2fs -f /dev/vdb
mount /dev/vdb /mnt/f2fs_test
fio --name=gc --directory=/mnt/f2fs_test --rw=randwrite \
--bs=4k --size=8G --numjobs=12 --fsync=4 --runtime=10 \
--time_based
echo 1 > /sys/fs/f2fs/vdb/gc_urgent
The uninitialized value causes incorrect GC victim selection, leading
to unpredictable garbage collection behavior.
Fix by zero-initializing the entire victim_sel_policy structure to
ensure all fields have defined values.
Fixes: e791d00bd0 ("f2fs: add valid block ratio not to do excessive GC for one time GC")
Cc: stable@kernel.org
Signed-off-by: Xiaole He <hexiaole1994@126.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d33f89b34a upstream.
F2FS can mount filesystems with corrupted directory depth values that
get runtime-clamped to MAX_DIR_HASH_DEPTH. When RENAME_WHITEOUT
operations are performed on such directories, f2fs_rename performs
directory modifications (updating target entry and deleting source
entry) before attempting to add the whiteout entry via f2fs_add_link.
If f2fs_add_link fails due to the corrupted directory structure, the
function returns an error to VFS, but the partial directory
modifications have already been committed to disk. VFS assumes the
entire rename operation failed and does not update the dentry cache,
leaving stale mappings.
In the error path, VFS does not call d_move() to update the dentry
cache. This results in new_dentry still pointing to the old inode
(new_inode) which has already had its i_nlink decremented to zero.
The stale cache causes subsequent operations to incorrectly reference
the freed inode.
This causes subsequent operations to use cached dentry information that
no longer matches the on-disk state. When a second rename targets the
same entry, VFS attempts to decrement i_nlink on the stale inode, which
may already have i_nlink=0, triggering a WARNING in drop_nlink().
Example sequence:
1. First rename (RENAME_WHITEOUT): file2 → file1
- f2fs updates file1 entry on disk (points to inode 8)
- f2fs deletes file2 entry on disk
- f2fs_add_link(whiteout) fails (corrupted directory)
- Returns error to VFS
- VFS does not call d_move() due to error
- VFS cache still has: file1 → inode 7 (stale!)
- inode 7 has i_nlink=0 (already decremented)
2. Second rename: file3 → file1
- VFS uses stale cache: file1 → inode 7
- Tries to drop_nlink on inode 7 (i_nlink already 0)
- WARNING in drop_nlink()
Fix this by explicitly invalidating old_dentry and new_dentry when
f2fs_add_link fails during whiteout creation. This forces VFS to
refresh from disk on subsequent operations, ensuring cache consistency
even when the rename partially succeeds.
Reproducer:
1. Mount F2FS image with corrupted i_current_depth
2. renameat2(file2, file1, RENAME_WHITEOUT)
3. renameat2(file3, file1, 0)
4. System triggers WARNING in drop_nlink()
Fixes: 7e01e7ad74 ("f2fs: support RENAME_WHITEOUT")
Reported-by: syzbot+632cf32276a9a564188d@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=632cf32276a9a564188d
Suggested-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/all/20251022233349.102728-1-kartikey406@gmail.com/ [v1]
Cc: stable@vger.kernel.org
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 297baa4aa2 upstream.
Xfstests generic/335, generic/336 sometimes crash with the following message:
F2FS-fs (dm-0): detect filesystem reference count leak during umount, type: 9, count: 1
------------[ cut here ]------------
kernel BUG at fs/f2fs/super.c:1939!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 609351 Comm: umount Tainted: G W 6.17.0-rc5-xfstests-g9dd1835ecda5 #1 PREEMPT(none)
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:f2fs_put_super+0x3b3/0x3c0
Call Trace:
<TASK>
generic_shutdown_super+0x7e/0x190
kill_block_super+0x1a/0x40
kill_f2fs_super+0x9d/0x190
deactivate_locked_super+0x30/0xb0
cleanup_mnt+0xba/0x150
task_work_run+0x5c/0xa0
exit_to_user_mode_loop+0xb7/0xc0
do_syscall_64+0x1ae/0x1c0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
</TASK>
---[ end trace 0000000000000000 ]---
It appears that sometimes it is possible that f2fs_put_super() is called before
all node page reads are completed.
Adding a call to f2fs_wait_on_all_pages() for F2FS_RD_NODE fixes the problem.
Cc: stable@kernel.org
Fixes: 20872584b8 ("f2fs: fix to drop all dirty meta/node pages during umount()")
Signed-off-by: Jan Prusakowski <jprusakowski@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e462fc48ce ]
The current version missed setting one time GC for normal zoned GC
cycle. So, valid threshold control is not working. Need to fix it to
prevent excessive GC for zoned devices.
Fixes: e791d00bd0 ("f2fs: add valid block ratio not to do excessive GC for one time GC")
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1d4c5dbba1 ]
Add a sysfs knob to set a multiplier for the background GC migration
window when F2FS Garbage Collection is boosted.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 59c1c89e9b ]
This patch introduces /sys/fs/f2fs/<dev>/reserved_pin_section for tuning
@needed parameter of has_not_enough_free_secs(), if we configure it w/
zero, it can avoid f2fs_gc() as much as possible while fallocating on
pinned file.
Signed-off-by: Chao Yu <chao@kernel.org>
Reviewed-by: wangzijie <wangzijie1@honor.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3fea0641b0 ]
This patch adds a new sysfs entry /sys/fs/f2fs/<disk>/encoding_flags,
it is a read-only entry to show the value of sb.s_encoding_flags, the
value is hexadecimal.
============================ ==========
Flag_Name Flag_Value
============================ ==========
SB_ENC_STRICT_MODE_FL 0x00000001
SB_ENC_NO_COMPAT_FALLBACK_FL 0x00000002
============================ ==========
case#1
mkfs.f2fs -f -O casefold -C utf8:strict /dev/vda
mount /dev/vda /mnt/f2fs
cat /sys/fs/f2fs/vda/encoding_flags
1
case#2
mkfs.f2fs -f -O casefold -C utf8 /dev/vda
fsck.f2fs --nolinear-lookup=1 /dev/vda
mount /dev/vda /mnt/f2fs
cat /sys/fs/f2fs/vda/encoding_flags
2
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit d7b549def0 ]
For several zoned storage devices, vendors will provide extra space
which was used for device level GC than specs and F2FS can use this
space for filesystem level GC. To do that, we can reserve the space
using reserved_blocks. However, it is not enough, since this extra
space should not be shown to users. So, with this new sysfs node,
we can hide the space by substracting reserved_blocks from total
bytes.
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f7f8932ca6 ]
If checkpoint is disabled, GC can not reclaim any segments, we need
to detect such condition and bail out from fallocate() of a pinfile,
rather than letting allocator running out of free segment, which may
cause f2fs to be shutdown.
reproducer:
mkfs.f2fs -f /dev/vda 16777216
mount -o checkpoint=disable:10% /dev/vda /mnt/f2fs
for ((i=0;i<4096;i++)) do { dd if=/dev/zero of=/mnt/f2fs/$i bs=1M count=1; } done
sync
for ((i=0;i<4096;i+=2)) do { rm /mnt/f2fs/$i; } done
sync
touch /mnt/f2fs/pinfile
f2fs_io pinfile set /mnt/f2fs/pinfile
f2fs_io fallocate 0 0 4201644032 /mnt/f2fs/pinfile
cat /sys/kernel/debug/f2fs/status
output:
- Free: 0 (0)
Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a907f3a68e ]
1. fadvise(fd1, POSIX_FADV_NOREUSE, {0,3});
2. fadvise(fd2, POSIX_FADV_NOREUSE, {1,2});
3. fadvise(fd3, POSIX_FADV_NOREUSE, {3,1});
4. echo 1024 > /sys/fs/f2fs/tuning/reclaim_caches_kb
This gives a way to reclaim file-backed pages by iterating all f2fs mounts until
reclaiming 1MB page cache ranges, registered by #1, #2, and #3.
5. cat /sys/fs/f2fs/tuning/reclaim_caches_kb
-> gives total number of registered file ranges.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ef0c333cad ]
This patch records POSIX_FADV_NOREUSE ranges for users to reclaim the caches
instantly off from LRU.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: e462fc48ce ("f2fs: maintain one time GC mode is enabled during whole zoned GC cycle")
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0fe1c6bec5 ]
Should cast type of folio->index from pgoff_t to loff_t to avoid overflow
while left shift operation.
Fixes: 3265d3db1f ("f2fs: support partial truncation on compressed inode")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
[ Modification: Using rpages[i]->index instead of folio->index due to
it was changed since commit:1cda5bc0b2fe ("f2fs: Use a folio in
f2fs_truncate_partial_cluster()") on 6.14 ]
Signed-off-by: Rajani Kantha <681739313@139.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 23361bd549 ]
When we get wrong extent info data, and look up extent_node in rb tree,
it will cause infinite loop (CONFIG_F2FS_CHECK_FS=n). Avoiding this by
return NULL and print some kernel messages in that case.
Signed-off-by: wangzijie <wangzijie1@honor.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 8fc6056dcf ]
As reported, on-disk footer.ino and footer.nid is the same and
out-of-range, let's add sanity check on f2fs_alloc_nid() to detect
any potential corruption in free_nid_list.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 48ea8b2004 ]
syzbot reports a f2fs bug as below:
------------[ cut here ]------------
kernel BUG at fs/f2fs/segment.c:2746!
CPU: 0 UID: 0 PID: 5323 Comm: syz.0.0 Not tainted 6.13.0-rc2-syzkaller-00018-g7cb1b4663150 #0
RIP: 0010:get_new_segment fs/f2fs/segment.c:2746 [inline]
RIP: 0010:new_curseg+0x1f52/0x1f70 fs/f2fs/segment.c:2876
Call Trace:
<TASK>
__allocate_new_segment+0x1ce/0x940 fs/f2fs/segment.c:3210
f2fs_allocate_new_section fs/f2fs/segment.c:3224 [inline]
f2fs_allocate_pinning_section+0xfa/0x4e0 fs/f2fs/segment.c:3238
f2fs_expand_inode_data+0x696/0xca0 fs/f2fs/file.c:1830
f2fs_fallocate+0x537/0xa10 fs/f2fs/file.c:1940
vfs_fallocate+0x569/0x6e0 fs/open.c:327
do_vfs_ioctl+0x258c/0x2e40 fs/ioctl.c:885
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x80/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Concurrent pinfile allocation may run out of free section, result in
panic in get_new_segment(), let's expand pin_sem lock coverage to
include f2fs_gc(), so that we can make sure to reclaim enough free
space for following allocation.
In addition, do below changes to enhance error path handling:
- call f2fs_bug_on() only in non-pinfile allocation path in
get_new_segment().
- call reset_curseg_fields() to reset all fields of curseg in
new_curseg()
Fixes: f5a53edcf0 ("f2fs: support aligned pinned file")
Reported-by: syzbot+15669ec8c35ddf6c3d43@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/675cd64e.050a0220.37aaf.00bb.GAE@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Rajani Kantha <681739313@139.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9d5c4f5c7a upstream.
Assuming the disk layout as below,
disk0: 0 --- 0x00035abfff
disk1: 0x00035ac000 --- 0x00037abfff
disk2: 0x00037ac000 --- 0x00037ebfff
and we want to read data from offset=13568 having len=128 across the block
devices, we can illustrate the block addresses like below.
0 .. 0x00037ac000 ------------------- 0x00037ebfff, 0x00037ec000 -------
| ^ ^ ^
| fofs 0 13568 13568+128
| ------------------------------------------------------
| LBA 0x37e8aa9 0x37ebfa9 0x37ec029
--- map 0x3caa9 0x3ffa9
In this example, we should give the relative map of the target block device
ranging from 0x3caa9 to 0x3ffa9 where the length should be calculated by
0x37ebfff + 1 - 0x37ebfa9.
In the below equation, however, map->m_pblk was supposed to be the original
address instead of the one from the target block address.
- map->m_len = min(map->m_len, dev->end_blk + 1 - map->m_pblk);
Cc: stable@vger.kernel.org
Fixes: 71f2c82062 ("f2fs: multidevice: support direct IO")
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit c2f7c32b25 ]
f2fs_zero_post_eof_page() may cuase more overhead due to invalidate_lock
and page lookup, change as below to mitigate its overhead:
- check new_size before grabbing invalidate_lock
- lookup and invalidate pages only in range of [old_size, new_size]
Fixes: ba8dac350f ("f2fs: fix to zero post-eof page")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e75ce11790 ]
If reserve_root mount option is not assigned, __allow_reserved_blocks()
will return false, it's not correct, fix it.
Fixes: 7e65be49ed ("f2fs: add reserved blocks for root user")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 77de19b686 upstream.
As Jiaming Zhang reported:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x1c1/0x2a0 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x17e/0x800 mm/kasan/report.c:480
kasan_report+0x147/0x180 mm/kasan/report.c:593
data_blkaddr fs/f2fs/f2fs.h:3053 [inline]
f2fs_data_blkaddr fs/f2fs/f2fs.h:3058 [inline]
f2fs_get_dnode_of_data+0x1a09/0x1c40 fs/f2fs/node.c:855
f2fs_reserve_block+0x53/0x310 fs/f2fs/data.c:1195
prepare_write_begin fs/f2fs/data.c:3395 [inline]
f2fs_write_begin+0xf39/0x2190 fs/f2fs/data.c:3594
generic_perform_write+0x2c7/0x910 mm/filemap.c:4112
f2fs_buffered_write_iter fs/f2fs/file.c:4988 [inline]
f2fs_file_write_iter+0x1ec8/0x2410 fs/f2fs/file.c:5216
new_sync_write fs/read_write.c:593 [inline]
vfs_write+0x546/0xa90 fs/read_write.c:686
ksys_write+0x149/0x250 fs/read_write.c:738
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xf3/0x3d0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The root cause is in the corrupted image, there is a dnode has the same
node id w/ its inode, so during f2fs_get_dnode_of_data(), it tries to
access block address in dnode at offset 934, however it parses the dnode
as inode node, so that get_dnode_addr() returns 360, then it tries to
access page address from 360 + 934 * 4 = 4096 w/ 4 bytes.
To fix this issue, let's add sanity check for node id of all direct nodes
during f2fs_get_dnode_of_data().
Cc: stable@kernel.org
Reported-by: Jiaming Zhang <r772577952@gmail.com>
Closes: https://groups.google.com/g/syzkaller/c/-ZnaaOOfO3M
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e23ab8028d ]
Let's return errors caught by the generic checks. This fixes generic/494 where
it expects to see EBUSY by setattr_prepare instead of EINVAL by f2fs for active
swapfile.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1005a3ca28 ]
w/ "mode=lfs" mount option, generic/299 will cause system panic as below:
------------[ cut here ]------------
kernel BUG at fs/f2fs/segment.c:2835!
Call Trace:
<TASK>
f2fs_allocate_data_block+0x6f4/0xc50
f2fs_map_blocks+0x970/0x1550
f2fs_iomap_begin+0xb2/0x1e0
iomap_iter+0x1d6/0x430
__iomap_dio_rw+0x208/0x9a0
f2fs_file_write_iter+0x6b3/0xfa0
aio_write+0x15d/0x2e0
io_submit_one+0x55e/0xab0
__x64_sys_io_submit+0xa5/0x230
do_syscall_64+0x84/0x2f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0010:new_curseg+0x70f/0x720
The root cause of we run out-of-space is: in f2fs_map_blocks(), f2fs may
trigger foreground gc only if it allocates any physical block, it will be
a little bit later when there is multiple threads writing data w/
aio/dio/bufio method in parallel, since we always use OPU in lfs mode, so
f2fs_map_blocks() does block allocations aggressively.
In order to fix this issue, let's give a chance to trigger foreground
gc in prior to block allocation in f2fs_map_blocks().
Fixes: 36abef4e79 ("f2fs: introduce mode=lfs mount option")
Cc: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e194e140ab ]
In lfs mode, dirty data needs OPU, we'd better calculate lower_p and
upper_p w/ them during has_not_enough_free_secs(), otherwise we may
encounter out-of-space issue due to we missed to reclaim enough
free section w/ foreground gc.
Fixes: 36abef4e79 ("f2fs: introduce mode=lfs mount option")
Cc: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5661998536 ]
- touch /mnt/f2fs/012345678901234567890123456789012345678901234567890123
- truncate -s $((1024*1024*1024)) \
/mnt/f2fs/012345678901234567890123456789012345678901234567890123
- touch /mnt/f2fs/file
- truncate -s $((1024*1024*1024)) /mnt/f2fs/file
- mkfs.f2fs /mnt/f2fs/012345678901234567890123456789012345678901234567890123 \
-c /mnt/f2fs/file
- mount /mnt/f2fs/012345678901234567890123456789012345678901234567890123 \
/mnt/f2fs/loop
[16937.192225] F2FS-fs (loop0): Mount Device [ 0]: /mnt/f2fs/012345678901234567890123456789012345678901234567890123\xff\x01, 511, 0 - 3ffff
[16937.192268] F2FS-fs (loop0): Failed to find devices
If device path length equals to MAX_PATH_LEN, sbi->devs.path[] may
not end up w/ null character due to path array is fully filled, So
accidently, fields locate after path[] may be treated as part of
device path, result in parsing wrong device path.
struct f2fs_dev_info {
...
char path[MAX_PATH_LEN];
...
};
Let's add one byte space for sbi->devs.path[] to store null
character of device path string.
Fixes: 3c62be17d4 ("f2fs: support multiple devices")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>