Kernel 6.9.12 index out of range #1577

jvinolas · 2024-10-30T07:37:48Z

Description

dmesg: index 2 is out of range for type 'ocf_map_info [*]'
Although the cache is mounted

Expected Behavior

No stack trace when activating existing cas cache

Actual Behavior

The cache is created after boot and seems to work but concerned about the dmesg stack trace.

Steps to Reproduce

Ubuntu 24.04 and upgrade kernel from builds https://kernel.ubuntu.com/mainline/v6.9.12/amd64/
git clone -b v24.9 https://github.com/Open-CAS/open-cas-linux.git
install module and create a cache over an dm-vdo
casadm -S -i 1 -d /dev/disk/by-id/nvme-SAMSUNG_MZWLR3T8HBLS-00AU3_S64VNE0NC03689-part1 --cache-line-size 64 -c wb --force
casadm -A -d /dev/disk/by-id/dm-name-vg_vdo0-vdo0 -i 1
casadm --set-param --name cleaning-alru -i 1 --wake-up 0 --staleness-time 1 --flush-max-buffers 10000 --activity-threshold 0
Set up at boot at /etc/opencas/opencas.conf with same ids and lazy_startup in the core (1 1 /dev/disk/by-id/dm-name-vg_vdo0-vdo0 lazy_startup=true)
reboot and shows that in dmesg

Context

We are concerned that, even the cache is mounted, this could involve hungs or corruptions when using it.

Possible Fix

Logs

[  156.574550] ------------[ cut here ]------------
[  156.574560] UBSAN: array-index-out-of-bounds in /root/install/open-cas-linux/modules/cas_cache/src/ocf/ocf_request.c:151:44
[  156.707912] index 2 is out of range for type 'ocf_map_info [*]'
[  156.778817] CPU: 58 PID: 1737 Comm: cas_cl_cache1 Tainted: G           OE      6.9.12-060912-generic #202408281052
[  156.778822] Hardware name: Oracle Corporation ORACLE SERVER X9-2c/TLA,MB TRAY,X9-2c, BIOS 66100200 03/29/2024
[  156.778825] Call Trace:
[  156.778829]  <TASK>
[  156.778834]  show_stack+0x49/0x60
[  156.778843]  dump_stack_lvl+0x5f/0x90
[  156.778854]  dump_stack+0x10/0x18
[  156.778858]  ubsan_epilogue+0x9/0x40
[  156.778862]  __ubsan_handle_out_of_bounds.cold+0x44/0x49
[  156.778866]  ocf_req_new_cleaner+0x21b/0x270 [cas_cache]
[  156.778926]  _ocf_cleaner_alloc_req+0x23/0x110 [cas_cache]
[  156.778953]  ocf_cleaner_fire+0x47/0x410 [cas_cache]
[  156.778976]  ? __pfx__ocf_cleaner_cmp+0x10/0x10 [cas_cache]
[  156.778997]  ocf_cleaner_do_flush_data_async+0x23/0x40 [cas_cache]
[  156.779017]  cleaning_alru_perform_cleaning+0x520/0x5e0 [cas_cache]
[  156.779047]  ? __pfx_cleaning_alru_perform_cleaning+0x10/0x10 [cas_cache]
[  156.779073]  ocf_cleaner_run+0x15f/0x200 [cas_cache]
[  156.779100]  ? _raw_spin_lock_irqsave+0xe/0x20
[  156.779106]  _cas_cleaner_thread+0x107/0x250 [cas_cache]
[  156.779128]  ? __pfx_autoremove_wake_function+0x10/0x10
[  156.779134]  ? __pfx__cas_cleaner_thread+0x10/0x10 [cas_cache]
[  156.779153]  kthread+0xe1/0x110
[  156.779156]  ? __pfx_kthread+0x10/0x10
[  156.779158]  ret_from_fork+0x44/0x70
[  156.779161]  ? __pfx_kthread+0x10/0x10
[  156.779163]  ret_from_fork_asm+0x1a/0x30
[  156.779166]  </TASK>
[  156.779167] ---[ end trace ]---

Configuration files

Your Environment

OpenCAS version (commit hash or tag): v24.9
Operating System: Ubuntu 24.04
Kernel version: 6.9.12-060912-generic #202408281052
Cache device type (NAND/Optane/other): nvme
Core device type (HDD/SSD/other): raid0 with dm-vdo over it
Cache configuration: casadm -S -i 1 -d /dev/disk/by-id/nvme-SAMSUNG_MZWLR3T8HBLS-00AU3_S64VNE0NC03689-part1 --cache-line-size 64 -c wb --force && casadm --set-param --name cleaning-alru -i 1 --wake-up 0 --staleness-time 1 --flush-max-buffers 10000 --activity-threshold 0
- Cache mode: wb
- Cache line size: 64
- Promotion policy: (default: always)
- Cleaning policy: (default: alru)
- Sequential cutoff policy: (default: full)
Other (e.g. lsblk, casadm -P, casadm -L)

sdb                         8:16   0    8T  0 disk  
└─sdb1                      8:17   0    8T  0 part  
  └─md100                   9:100  0   32T  0 raid0 
    └─vg_vdo0-lv0_vdata   252:0    0   31T  0 lvm   
      └─vg_vdo0-lv0-vpool 252:1    0  500T  0 lvm   
        └─vg_vdo0-vdo0    252:2    0  500T  0 lvm   
          └─cas1-1        251:0    0  500T  0 disk  
sdc                         8:32   0    8T  0 disk  
└─sdc1                      8:33   0    8T  0 part  
  └─md100                   9:100  0   32T  0 raid0 
    └─vg_vdo0-lv0_vdata   252:0    0   31T  0 lvm   
      └─vg_vdo0-lv0-vpool 252:1    0  500T  0 lvm   
        └─vg_vdo0-vdo0    252:2    0  500T  0 lvm   
          └─cas1-1        251:0    0  500T  0 disk  
sdd                         8:48   0    8T  0 disk  
└─sdd1                      8:49   0    8T  0 part  
  └─md100                   9:100  0   32T  0 raid0 
    └─vg_vdo0-lv0_vdata   252:0    0   31T  0 lvm   
      └─vg_vdo0-lv0-vpool 252:1    0  500T  0 lvm   
        └─vg_vdo0-vdo0    252:2    0  500T  0 lvm   
          └─cas1-1        251:0    0  500T  0 disk  
sde                         8:64   0    8T  0 disk  
└─sde1                      8:65   0    8T  0 part  
  └─md100                   9:100  0   32T  0 raid0 
    └─vg_vdo0-lv0_vdata   252:0    0   31T  0 lvm   
      └─vg_vdo0-lv0-vpool 252:1    0  500T  0 lvm   
        └─vg_vdo0-vdo0    252:2    0  500T  0 lvm   
          └─cas1-1        251:0    0  500T  0 disk

type    id   disk             status    write policy   device
cache   1    /dev/nvme0n1p1   Running   wb             -
└core   1    /dev/dm-2        Active    -              /dev/cas1-1

Cache Id                  1
Cache Size                609660960 [4KiB Blocks] / 2325.67 [GiB]
Cache Device              /dev/nvme0n1p1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           alru
Promotion Policy          always
Cache line size           64 [KiB]
Metadata Memory Footprint 2.8 [GiB]
Dirty for                 0 [s] / Cache clean
Status                    Running

The text was updated successfully, but these errors were encountered:

robertbaldyga · 2024-10-30T14:23:55Z

Hi @jvinolas! Thank you for reporting. This is a false positive from UBSAN - nothing to worry about. We already have workaround for it merged to the OCF master: Open-CAS/ocf#845 and Open-CAS/ocf#857.

jvinolas · 2024-11-12T12:51:12Z

I upgraded to current master commit (aafc6b49a6c4e4912d4dc3db0fe212a42d8f7e0b), did de build and install and the error came back again after a server hung and after activating the cache again with --load in the same environment:

casadm --version
╔═════════════════════════╤═════════════════════╗
║ Name                    │       Version       ║
╠═════════════════════════╪═════════════════════╣
║ CAS Cache Kernel Module │ 24.09.0.0912.master ║
║ CAS CLI Utility         │ 24.09.0.0912.master ║
╚═════════════════════════╧═════════════════════╝

[Tue Nov 12 13:38:30 2024] Inserting cache cache1
[Tue Nov 12 13:38:30 2024] cache1: Metadata initialized
[Tue Nov 12 13:38:30 2024] cache1: Successfully added
[Tue Nov 12 13:38:30 2024] cache1: Cache mode : wb
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_0 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_1 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_2 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_3 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_4 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_5 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_6 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_7 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_8 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_9 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_10 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_11 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_12 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_13 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_14 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_15 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_16 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_17 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_18 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_19 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_20 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_21 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_22 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_23 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_24 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_25 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_26 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_27 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_28 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_29 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_30 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_31 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_32 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_33 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_34 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_35 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_36 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_37 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_38 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_39 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_40 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_41 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_42 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_43 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_44 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_45 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_46 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_47 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_48 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_49 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_50 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_51 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_52 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_53 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_54 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_55 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_56 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_57 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_58 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_59 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_60 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_61 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_62 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_63 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_64 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_65 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_66 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_67 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_68 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_69 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_70 started
[Tue Nov 12 13:38:30 2024] Thread cas_io_1_71 started
[Tue Nov 12 13:38:30 2024] Thread cas_mngt_1 started
[Tue Nov 12 13:38:30 2024] cache1: Super block config offset : 0 kiB
[Tue Nov 12 13:38:30 2024] cache1: Super block config size : 2200 B
[Tue Nov 12 13:38:30 2024] cache1: Super block runtime offset : 128 kiB
[Tue Nov 12 13:38:30 2024] cache1: Super block runtime size : 4 B
[Tue Nov 12 13:38:30 2024] cache1: Reserved offset : 256 kiB
[Tue Nov 12 13:38:30 2024] cache1: Reserved size : 128 kiB
[Tue Nov 12 13:38:30 2024] cache1: Part config offset : 384 kiB
[Tue Nov 12 13:38:30 2024] cache1: Part config size : 48 kiB
[Tue Nov 12 13:38:30 2024] cache1: Part runtime offset : 640 kiB
[Tue Nov 12 13:38:30 2024] cache1: Part runtime size : 72 kiB
[Tue Nov 12 13:38:30 2024] cache1: Core config offset : 768 kiB
[Tue Nov 12 13:38:30 2024] cache1: Core config size : 512 kiB
[Tue Nov 12 13:38:30 2024] cache1: Core runtime offset : 1792 kiB
[Tue Nov 12 13:38:30 2024] cache1: Core runtime size : 1172 kiB
[Tue Nov 12 13:38:30 2024] cache1: Core UUID offset : 3072 kiB
[Tue Nov 12 13:38:30 2024] cache1: Core UUID size : 16384 kiB
[Tue Nov 12 13:38:30 2024] cache1: Cleaning offset : 35840 kiB
[Tue Nov 12 13:38:30 2024] cache1: Cleaning size : 446968 kiB
[Tue Nov 12 13:38:30 2024] cache1: LRU list offset : 482816 kiB
[Tue Nov 12 13:38:30 2024] cache1: LRU list size : 334980 kiB
[Tue Nov 12 13:38:30 2024] cache1: Collision offset : 817920 kiB
[Tue Nov 12 13:38:30 2024] cache1: Collision size : 1571292 kiB
[Tue Nov 12 13:38:30 2024] cache1: List info offset : 2389248 kiB
[Tue Nov 12 13:38:30 2024] cache1: List info size : 334980 kiB
[Tue Nov 12 13:38:30 2024] cache1: Hash offset : 2724352 kiB
[Tue Nov 12 13:38:30 2024] cache1: Hash size : 37212 kiB
[Tue Nov 12 13:38:30 2024] cache1: Cache line size: 64 kiB
[Tue Nov 12 13:38:30 2024] cache1: Metadata size on device: 2761600 kiB
[Tue Nov 12 13:38:30 2024] cache1: Loading cache state...
[Tue Nov 12 13:38:30 2024] Thread cas_cl_cache1 started
[Tue Nov 12 13:38:30 2024] cache1: Policy 'always' initialized successfully
[Tue Nov 12 13:38:30 2024] cache1.core1: Seqential cutoff init
[Tue Nov 12 13:38:30 2024] cache1: Disk lines = 8388608000
[Tue Nov 12 13:38:30 2024] cache1: ERROR: Cache device did not shut down properly!
[Tue Nov 12 13:38:30 2024] cache1: Initiating recovery sequence...
[Tue Nov 12 13:38:31 2024] cache1: Done loading cache state
[Tue Nov 12 13:38:34 2024] cache1: Done saving cache state!
[Tue Nov 12 13:38:34 2024] cache1: Cache attached
[Tue Nov 12 13:38:34 2024] cache1: Successfully loaded
[Tue Nov 12 13:38:34 2024] cache1: Cache mode : wb
[Tue Nov 12 13:38:34 2024] cache1: Cleaning policy : alru
[Tue Nov 12 13:38:34 2024] cache1: Promotion policy : always
[Tue Nov 12 13:38:34 2024] cache1.core1: Successfully added
[Tue Nov 12 13:38:34 2024] [Open-CAS] Adding device /dev/disk/by-id/nvme-SAMSUNG_MZWLR3T8HBLS-00AU3_S64VNE0NC03689-part1 as cache cache1
[Tue Nov 12 13:38:34 2024] [Open-CAS] [Classifier] Initialized IO classifier
[Tue Nov 12 13:38:34 2024] [Open-CAS] Adding device /dev/disk/by-id/dm-name-vg_vdo0-vdo0 as core core1 to cache cache1
[Tue Nov 12 13:38:34 2024] cache1: Write-back flush thread wake-up time: 0
[Tue Nov 12 13:38:34 2024] cache1: Write-back flush thread staleness time: 1
[Tue Nov 12 13:38:34 2024] cache1: Write-back flush thread max buffers flushed per iteration: 10000
[Tue Nov 12 13:38:35 2024] cache1: Write-back flush thread activity time threshold: 0
[Tue Nov 12 13:38:35 2024] SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled
[Tue Nov 12 13:38:35 2024] XFS (cas1-1): Mounting V5 Filesystem 92f465f2-e0e4-4b62-a621-2012f8f54894
[Tue Nov 12 13:38:35 2024] XFS (cas1-1): Starting recovery (logdev: internal)
[Tue Nov 12 13:38:35 2024] ------------[ cut here ]------------
[Tue Nov 12 13:38:35 2024] UBSAN: array-index-out-of-bounds in /root/install/open-cas-linux/modules/cas_cache/src/ocf/ocf_request.c:151:44
[Tue Nov 12 13:38:35 2024] index 1 is out of range for type 'ocf_map_info [*]'
[Tue Nov 12 13:38:35 2024] CPU: 25 PID: 3453 Comm: cas_io_1_25 Tainted: G           OE      6.9.12-060912-generic #202408281052
[Tue Nov 12 13:38:35 2024] Hardware name: Oracle Corporation ORACLE SERVER X9-2c/TLA,MB TRAY,X9-2c, BIOS 66100200 03/29/2024
[Tue Nov 12 13:38:35 2024] Call Trace:
[Tue Nov 12 13:38:35 2024]  <TASK>
[Tue Nov 12 13:38:35 2024]  show_stack+0x49/0x60
[Tue Nov 12 13:38:35 2024]  dump_stack_lvl+0x5f/0x90
[Tue Nov 12 13:38:35 2024]  dump_stack+0x10/0x18
[Tue Nov 12 13:38:35 2024]  ubsan_epilogue+0x9/0x40
[Tue Nov 12 13:38:35 2024]  __ubsan_handle_out_of_bounds.cold+0x44/0x49
[Tue Nov 12 13:38:35 2024]  ocf_req_new_cleaner+0x21b/0x270 [cas_cache]
[Tue Nov 12 13:38:35 2024]  _ocf_cleaner_alloc_req+0x23/0x110 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ocf_cleaner_fire+0x47/0x410 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ocf_engine_clean+0x5c/0x80 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ? __pfx__ocf_engine_clean_end+0x10/0x10 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ? __pfx__ocf_engine_clean_getter+0x10/0x10 [cas_cache]
[Tue Nov 12 13:38:35 2024]  _ocf_read_generic_do+0x28e/0x370 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ? ocf_engine_prepare_clines+0x444/0x6c0 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ocf_read_generic+0x4f/0xe0 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ocf_queue_run+0xf1/0x120 [cas_cache]
[Tue Nov 12 13:38:35 2024]  _cas_io_queue_thread+0x77/0x140 [cas_cache]
[Tue Nov 12 13:38:35 2024]  ? __pfx_autoremove_wake_function+0x10/0x10
[Tue Nov 12 13:38:35 2024]  ? __pfx__cas_io_queue_thread+0x10/0x10 [cas_cache]
[Tue Nov 12 13:38:35 2024]  kthread+0xe1/0x110
[Tue Nov 12 13:38:35 2024]  ? __pfx_kthread+0x10/0x10
[Tue Nov 12 13:38:35 2024]  ret_from_fork+0x44/0x70
[Tue Nov 12 13:38:35 2024]  ? __pfx_kthread+0x10/0x10
[Tue Nov 12 13:38:35 2024]  ret_from_fork_asm+0x1a/0x30
[Tue Nov 12 13:38:35 2024]  </TASK>
[Tue Nov 12 13:38:35 2024] ---[ end trace ]---
[Tue Nov 12 13:38:43 2024] workqueue: do_global_cleanup [dm_bufio] hogged CPU for >10000us 35 times, consider switching to WQ_UNBOUND
[Tue Nov 12 13:38:52 2024] XFS (cas1-1): Ending recovery (logdev: internal)

jvinolas · 2024-11-16T10:34:16Z

I've updated with commit as of now and getting the same UBSAN error.

Also few impacts now on writeback cache seen doing rsync to it and low (core mostly) throughput. It was working two days ago and now it is not impacting. Nothing was changed till we decided to try upgrading cache module. But behaviour is still the same.

jvinolas added the bug Something isn't working label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel 6.9.12 index out of range #1577

Kernel 6.9.12 index out of range #1577

jvinolas commented Oct 30, 2024

robertbaldyga commented Oct 30, 2024

jvinolas commented Nov 12, 2024

jvinolas commented Nov 16, 2024

Kernel 6.9.12 index out of range #1577

Kernel 6.9.12 index out of range #1577

Comments

jvinolas commented Oct 30, 2024

Description

Expected Behavior

Actual Behavior

Steps to Reproduce

Context

Possible Fix

Logs

Configuration files

Your Environment

robertbaldyga commented Oct 30, 2024

jvinolas commented Nov 12, 2024

jvinolas commented Nov 16, 2024