IBM Storage Scale APARs Resolved in 5.1.9.x

IJ51031

Metadata corruption on one node with folders not being correctly visible. Cannot cd into directory on one node. (show details)

Symptom	System that seems to have cached some bad data for a directory and cannot cd into the directory on the bad node
Environment	Linux Only
Trigger	Unknown
Workaround	None

5.1.9.5

No adverse affect. This is a failsafe change

IJ51457

If the File Audit Logging Audit Logs are compressed while GPFS is appending to them, the Audit Log can become corrupted and unrecoverable.This can happen when a compression policy is run against the audit log fileset / audit logs. (show details)

Symptom	Operation failure due to FS corruption
Environment	Linux Only
Trigger	The problem could be triggered if the Audit Logs are compressed by something other that File Audit Logging code. When File Audit Logging wraps logs, FAL compresses the Audit Logs after mmfsd is done appending to them. If a user or program attempts to compress logs that are currently being appended to, unrecoverable corruption can happen to that Audit Log.
Workaround	None

5.1.9.5

File Audit Logging

IJ49862

When daemon restarts on a worker node, it is possible to have a race condition that causes worker local state change to take place after GNR's readmit operation which intends to repair tracks with stale data. The delayed state change could result the intended readmit operation to fail to repair the data on the given disks, thus result in stale sectors in the tracks which could have been fixed once the delayed state change takes place. With more disk failures before the next cycle of scan and repair operations having a chance to repair these vtracks, it could result data loss if number of faults are beyond the fault tolerance of the vdisk. (show details)

Symptom	Daemon crash
Environment	All
Trigger	Daemon restart on individual ECE node, or shared ESS node (even though much less likely), followed by more failing disks.
Workaround	Before the fix is installed, manually verify if there is any vtracks stuck in stale state.

5.1.9.5

GNR

IJ51652

Configuring perfmon --collectors with a non-cluster node name (e.g. the hostname which is different to the admin or daemon name) will fail mmsysmon noderoles detection and cause perfmon query port down and the GUI node will raise gui_pmcollector_connection_failed event. (show details)

Symptom	Event gui_pmcollector_connection_failed on GUI node.
Environment	Linux Only
Trigger	Use of mmperfmon config option --collectors with an invalid cluster node identifier.
Workaround	None

5.1.9.5

Performance Monitoring Tool, GUI, mmhealth thresholds, GrafanaBridge

IJ51658

Signal 11 hit in function AclDataFile::hashInsert in acl.C, due to race condition when adding ACLs and handling cached ACL data invalidation during node recovery or hitting a disk error, resulting in mmfsd daemon crash. (show details)

Symptom	Abend/Crash
Environment	All Operation System environments
Trigger	Setting ACLs while node recovery or a disk error happens. Node recovery or disk error invalidates the internal cached ACL data so there is a small window when setting ACLs during this time can cause unassigned memory access.
Workaround	Avoid settings ACLs to inodes

5.1.9.5

All Scale Users

IJ51363

Scanning directory with policy or Scale gpfs_ireaddir64 API is degraded since 5.1.3 release. (show details)

Symptom	Performance impact
Environment	All Operating Systems
Trigger	Run policy job or use Scale gpfs_ireaddir64 API to scan directory in Scale file system
Workaround	None

5.1.9.5

policy or gpfs_ireaddir64 API

IJ51704

Triggering recovery on IW fileset (by running ls -l on root of fileset), with afmIOFlag afmRecoveryUseFset set on it causes a deadlock - which resolves itself almost after 10 minutes (300 retries of queueing of Getattr for the ls command). (show details)

Symptom	Deadlock
Environment	Linux Only
Trigger	Triggering Recovery on the fileset using ls -l.
Workaround	Don't trigger recovery on IW fileset using "ls -l" on the fileset. Instead use the makeActive subcommand in the mmafmctl.

5.1.9.5

AFM

IJ51705

1. Introduce new config option - afmSkipPtrash to skip moving files to the .ptrash directory.
2. Also add a mmafmctl subcommand \"emptyPtrash\" to cleanup the ptrash directory without relying on rm -rf. Similar to the one implemented in --empty-ptrash flag of prefetch. (show details)

Symptom	Unexpected Behavior
Environment	All OS Environments
Trigger	Need for a separate command that can assist in deletion of .ptrash directory contents.
Workaround	manually cleanup ptrash directory by hand or by peforming --empty-ptrash with the prefetch subcommand.

5.1.9.5

AFM

IJ51706

afmCheckRefreshDisable is a tunable at the cluster level today to avoid refresh from going to the Filesystem itself and return from the dcache. But it applies to all AFM filesets in the cluster, when tuned. Need a fileset level tunable to do the same so that it doesn't impact all other filesets in the cluster like it does today. (show details)

Symptom	Unexpected Behavior
Environment	All OS Environments
Trigger	Enabling afmCheckRefreshDisable config at the cluster level.
Workaround	None

5.1.9.5

AFM

IJ51707

For some threshold events the system pushes them from the Threshold to the Filesystem component internally. Due to misaligned data the event could get suppressed. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Some parts get sorted and if fileset name get sorted before file system name the issue hits. Most likely with uppercase fileset names (assuming lower case filesystem name)
Workaround	Instead of using default threshold rule create one that only looks at gpfs_fset_freeInodes absolute number to reliably raise events or check Inode usage over time in the GUI

5.1.9.5

• System Health
• perfmon (Zimon)

IJ51708

When dynamic pagepool is enabled, pagepool memory is shrinking slowly when memoryconsuming application is requesting memory (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	Low system memory
Workaround	No

5.1.9.5

All Scale Users

IJ51709

Pagepool grow is rejected due to recent pagepool change history (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Linux OS environments
Trigger	Daemon startup
Workaround	None

5.1.9.5

All Scale Users

IJ51710

Memory allocation from the shared segment failed (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	pagepool growing and shrinking
Workaround	No

5.1.9.5

All Scale Users

IJ51711

If a mount is symbolic link is attempted on an existing symlink of a directory, then it ends up creating the symbolic link with same name as the source inside the target directory. Since the DR is mostly RO in nature, it ends up getting an E_ROFS and prints these failures to the log. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	Remount of AFM DR target over the NSD backend target.
Workaround	None

5.1.9.5

AFM

IJ51712

mmwmi.exe is a helper utility on Windows which is used by various mm* administration scripts to query various system settings such as IP addresses, mounted filesystems and so on. Under certain conditions such as active realtime scanning by security endpoints and anti-malwares, the output of mmwmi is not sent to stdout and any connected pipes that depend on it. This can cause various GPFS configuration scripts, such as mmcrcluster to fail. (show details)

Symptom	Unexpected Results/Behavior.
Environment	Windows/x86_64 only.
Trigger	Execute GPFS administration commands (such as mmcrcluster) during active anti-virus and anti-malware realtime scanning.
Workaround	None.

5.1.9.5

Admin Commands.

IJ51713

Problem here is that during conversion a wrong target is specified, with protocol as jttp instead of http. Leading to parsePcacheTarget to find this as invalid, but later trying to persist a NULL to disk where the assert goes off. (show details)

Symptom	Crash
Environment	All OS Environments
Trigger	Converting a non-AFM fileset to AFM MU mode with wrong protocol in the target.
Workaround	Specify one of the correct supported protocols.

5.1.9.5

AFM

IJ51781

"mmperfmon delete" shows a usage string, referencing "usage: perfkeys delete [-h]" instead of the proper usage. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	"mmperfmon delete" should be used
Workaround	None

5.1.9.5

perfmon (ZIMON)

IJ51782

Customer are getting "SyntaxWarning: invalid escape sequence" errors when "mmperfmon" is used for custom scripting. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	"mmperfmon" should be used for custom scripting,
Workaround	None

5.1.9.5

perfmon (ZIMON)

IJ51783

recovery is not syncing old directories* (show details)

Symptom	Unexpected Results/
Environment	Linux Only
Trigger	1. create fileset without creating target bucket. 2. create 2 directories , fileset will be in unmounted state. 3. stop and start fileset 4. create another new dir
Workaround	Instead of mkdir operation run touch operation it will sync old directories in case of recovery.

5.1.9.5

AFM

IJ51784

mmafmcosconfig fails to create afmcos fileset in sudo configured setup (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	With sudo configured setup try to create fileset using mmafmcosconfig , it fails with error .
Workaround	None

5.1.9.5

AFM

IJ51785

Not able to initialiase download in case fileset is in Dirty state (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	create IW fileset and create one file under IW fileset , while fileset is in Dirty state try running download. It gives error .
Workaround	None

5.1.9.5

AFM

IJ51786

AFM fileset is going in NeedsResync state due to replication of file whose parent directory is local. (show details)

Symptom	Fileset in needsResync state..
Environment	Linux.
Trigger	Upload of files from AFM fileset where parent is local.
Workaround	None

5.1.9.5

AFM

IJ51787

When a large number of secure connections are created at the same time between the mmfsd daemon instances in a Scale cluster, some of the secure connections may fail as a result of timeouts,resulting in unstable cluster operations. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL
Trigger	Rebooting all nodes of a large Scale cluster at the same time.
Workaround	Stage the rebooting of nodes in large Scale clusters such that they don't reboot at the same time.

5.1.9.5

GPFS Core

IJ51843

Kernel crashes with the following assert message:
GPFS logAssertFailed: vinfoP->viInUse. (show details)

Symptom	Crash
Environment	Linux Only
Trigger	The problem can happen when closing a file opened via NFS.
Workaround	None

5.1.9.5

NFS exports

IJ51844

Newer versions of libmount1 package that are installed by default on SUSE15.6 filter out device name from gpfs mount options due to which mount fails. (show details)

Symptom	Mount failure
Environment	Linux SLES15 SP6
Trigger	System should have a libmount1 package installed that is > 2.37.
Workaround	Downgrade libmount1 package if possible.

5.1.9.5

GPFS core

IJ51845

AFM Gateway node reboot due to Out of memory exception. There is memory leak while doing the upload/reconcile operation in MU mode fileset. (show details)

Symptom	OOM exception on AFM Gateway node.
Environment	Linux.
Trigger	Upload of files from AFM MU mode fileset to COS.
Workaround	None

5.1.9.5

AFM

IJ50654

mmshutdown caused kernel crash while calling dentry_unlink_inode with the backtrace like this:
...
#10 page_fault at ffffffff8d8012e4
#11 iput at ffffffff8cef25cc
#12 dentry_unlink_inode at ffffffff8ceed5d6
#13 __dentry_kill at ffffffff8ceedb6f
#14 dput at ffffffff8ceee480
#15 __fput at ffffffff8ced3bcd
#16 ____fput at ffffffff8ced3d7e
#17 task_work_run at ffffffff8ccbf41f
#18 do_exit at ffffffff8cc9f69e (show details)

Symptom	Kernel crash during mmshutdown
Environment	All Linux OS environments
Trigger	Kernel crash with dentry_unlink_inode when run mmshutdown.For the normal open(), the kernel seems to call fops_get, which is a call totry_module_get.The fix: We need to call try_module_get when we install cleanupFD.This will hold the module in place until gpfs_f_cleanup (called when the lastmmfsd process terminates and allows basic cleanup for next daemon startup)has been called for the cleanupFD.
Workaround	None

5.1.9.5

All Scale Users (Linux)

IJ51332

GPFS daemon could assert unexpectedly with EXP(REGP != _NULL) in file alloc.C. This could occur on client nodes where there are active block allocation activities. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Block allocation and deallocation activities on a client node
Workaround	None

5.1.9.5

All Scale Users

IJ50480

Long ACL garbage collection runs in the filesystem manager can cause lock conflicts with nodes that need to retrieve ACLs during garbage collection. The conflicts will resolve after garbage collection has finished. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	All Operation System environments
Trigger	- Set new and unique ACLs for inodes - Delete inodes with ACLs - After some time, the ACL GC is started to clean up unreferenced ACLs (ACLs that no inodes reference) - During the ACL GC run, retrieve the ACLs from existing inodes in the filesystem
Workaround	- Avoid setting new and unique ACLs for the inodes or, - Change the filesystem manager to another node to stop the current garbage collection run or, - Wait for the ACL GC to finish or, - Use mode bits instead of ACLs

5.1.9.5

All Scale Users

IJ51846

due to locale issue few callhome commands were gethering data region specific and it causing the error in AoAtool while parsing this data (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	on non-english locale run the callhome commands
Workaround	None

5.1.9.5

Callhome

IJ51864

Crash of the node mounting a filesystem or while starting the node. (show details)

Symptom	Crash
Environment	Linux Only
Trigger	A failure or error condition hit during the parsing of fstab entry.
Workaround	None

5.1.9.5

Filesystem mount

IJ51908

When dynamic pagepool is enabled, we may not shrink due to pagepool grow still in progress, which results in out of memory (show details)

Symptom	Abend/Crash
Environment	ALL Linux OS environments
Trigger	Pagepool growing and shrining
Workaround	No

5.1.9.5

Dynamic pagepool

IJ51909

There are a few occasions where error code 809 may be used inside the CCR Quorum management component. Although not user actionable, the product was changed to make some note of this in mmfs.log instead of, as had been the case, only available in GPFS Trace. The intent is to improved RAS in certain situations. (show details)

Symptom	N/a
Environment	All
Trigger	N/a
Workaround	Use "mmlsmgr" before running "mmchnode --nonquorum" to determine the current cluster manager. If the node to be changed is the current cluster manager, then use mmlsmgr to determine the new cluster manager and then, if necessary, run mmchmgr to make an explicit choice of cluster manager.

5.1.9.5

CCR

IJ51011

Nessus vulnerability scanner found HSTS communication is not enforced on mmsysmon port 9980 (show details)

Symptom	Nessus vulnerability scan finding/record (medium severity)
Environment	Linux
Trigger	Nessus vulnerability scan
Workaround	None

5.1.9.4

mmsysmon on GUI/pmcollector node

IJ50232

The automated node expel mechanism (see references to themmhealthPendingRPCExpelThreshold configuration parameter) uses the internalmmsdrcli to issue an expel request to a node in the home cluster. If thesdrNotifyAuthEnabled configuration parameter is set to false (not recommended) then the command will fail with a message like the following:
[W] The TLS handshake with node 192.168.132.151 failed with error 410 (client side).mmsdrcli: [err 144] Connection shut down and the expel request will fail. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	The problem is triggered by the following (all conditions required) : - The sdrNotifyAuthEnabled configuration parameter is set to false - Automated expel is enabled via the mmhealthPendingRPCExpelThreshold configuration parameter - A node becomes hung or otherwise unable to respond to a token revoke request
Workaround	Set the sdrNotifyAuthEnabled configuration parameter to 'true' andthen restart (in a rolling fashion) all the nodes on both home andclient clusters. Once that is done, the mmsdrcli command should nolonger fail. See also https://www.ibm.com/support/pages/node/6560094

5.1.9.4

System Health
(even though the fix is not in system health itself)

IJ51036

mm{add|del}disk will fail which is triggered by signal 11. (show details)

Symptom	failure of the command.
Environment	Linux Only
Trigger	Run mm{add\|del}disk with multiple NSDs.
Workaround	The problem/symptom would typically occur when there are multiple NSDs added/deleted with the command. By running the command with one NSD at a time, we can avoid the problem/symptom.

5.1.9.4

disk configuration and region management

IJ51037

mmkeyserv returns an error when used to delete a previously delete tenant, instead of returning a success return code. (show details)

Symptom	Failure to remove an already deleted tenant.
Environment	ALL Linux OS environments
Trigger	Remove a Scale tenant from the GKLM server prior to invoking the 'mmkeyserv tenant delete' command.
Workaround	mmkeyserv can be used with the --force option to remove the Scale definition of a deleted tenant.

5.1.9.4

File System Core

IJ51057

From a Windows client, in MMC permissions Tab on a share, the ACL listing was always showing as Everyone.If a subdirectory inside subdirectory is deleted, in the respective snapshot that was taken before, traversal to the inner subdirectory was showing errors. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	The MMC permissions problem can happen in case of viewing permissions from MMC permissions tab for a share. Snapshot traversal would show problems from client when a sub-directory within sub-directory is deleted on the actual filesystem and the same is tried for access in snapshot directory.
Workaround	None

5.1.9.4

SMB

IJ51148

find or download all when run on a given path, sets timefor each of the individual entities with respect to COSand ends up blocking a following revalidation to fetchactual changes on the object's metadata from the COSto cache. (show details)

Symptom	Unexpected Behaviour
Environment	All OS Environments
Trigger	Performing an ls on directory before trying lookup for the file.
Workaround	None

5.1.9.4

AFM

IJ51149

Due to an issue with the way mmfsckx scans compressed files and internally stores information to detect if it has inconsistent compressed groups, mmfsckx
will report and/or repair false positive inconsistencies for compressed files.
The mmfsckx output will report something like below for example:
!Inode 791488 snap 6 fset 6 "user file" indirect block 1 level 1 @4:13508288: disk address (ditto) in slot 0 replica 0 pointing to data block 226 code 2012 is invalid (show details)

Symptom	False positive corrections by mmfsckx
Environment	ALL Operating System environments
Trigger	mmfsckx run on file system having compressed files
Workaround	Run offline mmfsck

5.1.9.4

mmfsckx

IJ51150

mmfsckx captures allocation and deallocation information of blocks from remote client nodes or non-participating nodes that mount the file system while mmfsckx is running. And once the file system gets unmounted from these nodes it stops the capture of such information. But due to an issue mmfsckx was stopping the capture of such information before the complete unmount event ended and that led to mmfsckx then reporting and/or repairing false positive lost blocks, bad (incorrectly allocated) blocks, duplicate blocks. (show details)

Symptom	False positive corrections by mmfsckx
Environment	ALL Operating System environments
Trigger	File system is unmounted on remote client or non-participating nodes while mmfsckx is running
Workaround	Run offline mmfsck

5.1.9.4

mmfsckx

IJ51252

Prefetch command failes but returns error code 0 (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	run prefetch on non-existant file.
Workaround	None

5.1.9.4

AFM

IJ51225

There is a build failure while executing mmbuildgpl command. The failure is seen while compiling /usr/lpp/mmfs/src/gpl-linux/kx.c due to no member named '__st_ino' in struct stat. Please refrain from upgrade to affected kernel and/or OpenShift levels until fix is available (show details)

Symptom	Build failure while building kernel gpl modules.
Environment	Linux Only OpenShift 4.13.42+, 4.14.14.25+, 4.15.13+ with IBM Storage Scale Container Native, Fusion with GDP
Trigger	The problem could be triggered by newer kernels containing Linux kernel commit 5ae2702d7c482edbf002499e23a2e22ac4047af1
Workaround	None

5.1.9.4

Build / Installation

IJ51222

If a problem with an encryption server happens just the Rkmid is visible in the default "mmhealth node show" view. Furthermore there are two monitoring mechanisms, one which will use an index to convey whether main or backup server is affected and one which will directly use the hostname or ip for that. Moreover the usual way to resolve an event with "mmhealth event resolve" has been broken for that component. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Code change to have RkmId as the common ground for old and new monitoring methods.
Workaround	mmhealth node show encryption -v (or -Y) to see the server information. mmhealth node eventlog -Y would work as well. Resolving the event needs a "mmsysmoncontrol restart"

5.1.9.4

System Health

IJ51160

Daemon Assert gets triggered when afmLookupMapSize is set tohigher value of 32. Supported range is only 0 to 30. (show details)

Symptom	Abend/Crash
Environment	All OS environments
Trigger	Setting value of afmLookupMapSize to high value of 32.
Workaround	set the value of afmLookupMapSize in the 0 to 30 range only.

5.1.9.4

AFM

IJ51282

mmrestoreconfig also restores fileset configuration of a file systems. If the cluster version (minReleaseLevel) is below 5.1.3.0, the fileset restore will fail as it tries to restore the fileset permission inherit mode even if it is the default. The permission inherit mode was not enabled Storage Scale version 5.1.3.0. (show details)

Symptom	• Error output/message • Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	mmbackupconfig and mmrestoreconfig on Storage Scale Product version 5.1.3.0 or higher while the cluster minReleaseLevel is below 5.1.3.0.
Workaround	N/A

5.1.9.4

Admin Commands
SOBaR

IJ51283

Command mmchnode and mmumount did not cleanup tmp node files in /var/mmfs/tmp. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	Run mmchnode and mmumount command.
Workaround	Manually remove leftover tmp files from mmchnode and mmumount

5.1.9.4

Admin Commands

IJ51286

GPFS daemon could unexpectedly fail with signal 11 when mounting a file system if file system quiesce is triggered during the mount process. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	File system quiesce triggered via file system command while file system mount is in progress
Workaround	Avoid running commands that trigger file system quiesce while client nodes are in the process of mounting the file system

5.1.9.4

All Scale Users

IJ51265

It is possible for EA overflow block to be corrupted as result of log recovery after node failure. This can lead to lost of some extended attributes that can not be stored in the inode. (show details)

Symptom	Operation failure due to FS corruption
Environment	ALL Operating System environments
Trigger	Node failure after repeated extended attributes operations which trigger creation and deletion of overflow block
Workaround	None

5.1.9.4

All Scale Users

IJ51344

When writing to a memory-mapped file, there is a chance that incorrect data could be written to the file before and after the targeted write range (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Operating System environments
Trigger	Writing to memory-mapped files with offsets and lengths unaligned to the internal buffer range size (usually subblock size or 4k) could cause incorrect data to be written before and after the targeted write range
Workaround	Stop using memory-mapping

5.1.9.4

All scale users

IJ49992

If the local cluster nistCompliance value is off, the mmremotecluster and mmauth commands fail with not clear error message. (show details)

Symptom	Error output/message
Environment	ALL Operating System environments
Trigger	Running mmauth and mmremotecluster command to add or update remote cluster while the local cluster not nistCompliance.
Workaround	Fix the error and reissue the command.

5.1.9.3

All Scale Users

IJ50066

AFM LU mode fileset from a filesystem, to a target in the same filesystem (snapshot), using NSD backend is failing with error 1.
Happening because another code fix had an unintended consequence for this code path. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	Creating a LU mode fileset to a target (snapshot) in the samefilesystem.
Workaround	None

5.1.9.3

AFM

IJ50067

When afmResyncVer2 is run with afmSkipResyncRecovery set to yes,then the priority directories that AFM usually queues shouldnot be done.
Since such directories might exist underparents that are not in sync already leading to error 112. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	Run afmResyncVer2 on a fileset which has never had syncto home and set afmSkipResyncRecovery on it.
Workaround	fallback to Resync Version1 where afmSkipResyncRecoveryhas effect.

5.1.9.3

AFM

IJ50068

This APAR addresses two issues related to NFS-Ganesha that can cause crashes. Here are the details:

(gdb) bt
#0 0x00007fff88239a68 in raise ()
#1 0x00007fff8881ffb8 in crash_handler (signo=11, info=0x7ffb42abbe48, ctx=0x7ffb42abb0d0)
#3 0x00007fff888da5f4 in atomic_add_int64_t (augend=0x148, addend=1)
#4 0x00007fff888da658 in atomic_inc_int64_t (var=0x148)
#5 0x00007fff888de44c in _get_gsh_export_ref (a_export=0x0)
#6 0x00007fff8888c6c0 in release_lock_owner (owner=0x7ffef94a1cc0)
#7 0x00007fff88923e9c in nfs4_op_release_lockowner (op=0x7ffef922be60, data=0x7ffef954d290, resp=0x7ffef8629c30)
#8 0x00007fff888fb810 in process_one_op (data=0x7ffef954d290, status=0x7ffb42abcdf4)
#9 0x00007fff888fcc9c in nfs4_Compound (arg=0x7ffef95eec38, req=0x7ffef95ee410, res=0x7ffef8ce4b40)
#10 0x00007fff88819130 in nfs_rpc_process_request (reqdata=0x7ffef95ee410, retry=false)
#11 0x00007fff88819864 in nfs_rpc_valid_NFS (req=0x7ffef95ee410)
#12 0x00007fff88750618 in svc_vc_decode (req=0x7ffef95ee410)
#13 0x00007fff8874a8f4 in svc_request (xprt=0x7fff30039ca0, xdrs=0x7ffef95eb400)
#14 0x00007fff887504ac in svc_vc_recv (xprt=0x7fff30039ca0)
#15 0x00007fff8874a82c in svc_rqst_xprt_task_recv (wpe=0x7fff30039ed8)
#16 0x00007fff8874b858 in svc_rqst_epoll_loop (wpe=0x10041cc5cb0)
#17 0x00007fff8875b22c in work_pool_thread (arg=0x7ffdcd1047d0)
#18 0x00007fff88229678 in start_thread ()
#19 0x00007fff880d8938 in clone ()

Or

(gdb) bt
#0 0x00007f96f58d9b8f in raise ()
#1 0x00007f96f75c6633 in crash_handler (signo=11, info=0x7f96ad9fc9b0, ctx=0x7f96ad9fc880) a
#3 dec_nfs4_state_ref (state=0x7f9640465440)
#4 0x00007f96f76762f9 in dec_state_t_ref (state=0x7f9640465440)
#5 0x00007f96f767640c in nfs4_op_free_stateid (op=0x7f8dec12fba0, data=0x7f8dec1992b0, resp=0x7f8dec04ce70)
#6 0x00007f96f766dbae in process_one_op (data=0x7f8dec1992b0, status=0x7f96ad9fe128)
#7 0x00007f96f766ee80 in nfs4_Compound (arg=0x7f8dec110ab8, req=0x7f8dec110290, res=0x7f8dec5b7db0)
#8 0x00007f96f75c17db in nfs_rpc_process_request (reqdata=0x7f8dec110290, retry=false)
#9 0x00007f96f75c1cf1 in nfs_rpc_valid_NFS (req=0x7f8dec110290)
#10 0x00007f96f733edfd in svc_vc_decode (req=0x7f8dec110290)
#11 0x00007f96f733ac61 in svc_request (xprt=0x7f95d00c4a60, xdrs=0x7f8dec18dd00)
#12 0x00007f96f733ed06 in svc_vc_recv (xprt=0x7f95d00c4a60)
#13 0x00007f96f733abe1 in svc_rqst_xprt_task_recv (wpe=0x7f95d00c4c98)
#14 0x00007f96f73462f6 in work_pool_thread (arg=0x7f8ddc0cc2f0)
#15 0x00007f96f58cf1ca in start_thread ()
#16 0x00007f96f5119e73 in clone ()
(show details)

Symptom	Abend/Crash
Environment	Linux Only
Trigger	The crash occurs when the NFSv4 client attempts to access and delete a file simultaneously through different processes or threads, potentially leading to timing issues.
Workaround	None

5.1.9.3

NFS-Ganesha crash followed by CES-IP failover.

IJ49856

Multi-threaded applications that issue mmap I/O and I/O system calls concurrently can hit a deadlock on the buffer lock. This is likely not a common pattern, but this problem has been observed with database applications. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	Multiple application perform read/write to the same file at the same time.
Workaround	Avoid concurrently read/write to same file from multiple process.

5.1.9.3

All Scale Users

IJ50208

Multi-threaded applications that issue mmap I/O and I/O system calls concurrently can hit a deadlock on the buffer lock. This is likely not a common pattern, but this problem has been observed with database applications. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Linux OS environments
Trigger	This problem is a race between three threads within the same process: 1) One thread accessing data in a mmap'ed GPFS file. 2) A second thread issuing any system calls that modifies the memory layout of the process (e.g. mmap, munmap, ...) 3) A third thread issuing an I/O system call (read or write), that accesses the same file and the same offset as thread 1, where access the userspace buffer also hits a page fault./td>
Workaround	Since this problem is specific to a newer codepath, this codepath can be disabled through a hidden config setting: mmchconfig mmapOptimizations=0

5.1.9.3

All Scale Users

IJ50209

Setting security header as suggested by RFC 6797 (show details)

Symptom	Unexpected Results/Behavior [not really, unless one really looks at the returned header fields of the HTTP response - body data is not affected]
Environment	ALL Linux OS environments
Trigger	Running Scale 5.1.2 or later
Workaround	None

5.1.9.3

perfmon (Zimon)

IJ50210

With File Audit Logging (FAL) is enabled, when a change to the policy file happens and when the LWE garbage collector runs for FAL, there is a small window that a deadlock can occur with the long waiter message seen
'waiting for shared ThSXLock' for the PolicyCmdThread. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	- Enable FAL - Generate events in the file system - Make a change to the policy file/td>
Workaround	- Restart GPFS, and/or - Disable FAL

5.1.9.3

File Audit Logging

IJ50211

During a mount operation of the file system, updating LWE configuration information for File Audit Logging before the Fileset metadata file (FMF) is initialized results in the signal 11, NotGlobalMutexClass::acquire() + 0x10 at mastSMsg.C:44 (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	- Enable FAL - Mount the file system - The update to the LWE config information happens before the FMF is initialized during the mount operation
Workaround	Disable FAL

5.1.9.3

File Audit Logging

IJ50320

AFM fileset is going in NeedsResync state due to replication of filewhose parent directory is local. (show details)

Symptom	Fileset in needsResync state..
Environment	Linux
Trigger	Upload of files from AFM fileset where parent is local
Workaround	None

5.1.9.3

AFM

IJ50321

When a thread is flushing the file metadata of the ACL file to disk, there's a small window that a deadlock can occur when a different thread tries to get a Windows security descriptor, as getting the security descriptor requires reading the ACL file. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Windows
Trigger	- Make changes to Windows Security Descriptors/ACLs and ensure the changes go to the disk - Retrieve the Windows Security Descriptor of inodes
Workaround	- Avoid setting and getting Windows Security Descriptor at the same time - Restart GPFS

5.1.9.3

All Scale Users

IJ50323

When checking the block alloc map mmfsckx excludes the regions that are being checked or are already checked from further getting updated in the internal shadow map.But when checking for such excluded regions it was not checking which poolId the region belonged to. This resulted in mmfsckx not updating the shadow map for a region belongingto a pool while checking the block alloc map for the same region belonging to a different pool. This led to mmfsckx falsely marking blocks as lost block and then later to this assert. (show details)

Symptom	Node assert
Environment	ALL Operating System environments
Trigger	Run mmfsckx with --repair
Workaround	Run offline mmfsck to fix corruptions

5.1.9.3

mmfsckx

IJ50035

When RDMA verbsSend is enabled and number of RDMA connections is largerthan 16, if reconnect happens, could cause segment fault issue. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	RDMA verbsSend and TCP reconnect
Workaround	None

5.1.9.3

All Scale Users

IJ50372

O_TRUNC is not ignored correctly after a successful file lookup during atomic_open() so truncation can happen during the open routine, before permission checks happen. This leads to a scenario in which a user on a different node can truncate a file which he does not have permissions to. (show details)

Symptom	Unexpected Results/Behavior
Environment	All Linux OS environments
Trigger	- A file is created under a user with no write permissions for group and others (e.g mode 644) in one node - A user on a different node atomic opens the file with O_TRUNC flag and tries to write to it
Workaround	- Avoid using O_TRUNC with atomic_open()

5.1.9.3

All Scale Users

IJ50373

For certain performance monitoring operations in the case of an error the query and response get logged. That response can be large and logging it regularly will cause mmsysmon.log to grow rapidly. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	Incomplete delete of performance metrics. Some checks are done on so called “measurements” which take several metrics and calculate a composite result. If only a subset of the metrics for the calculation is available, the error is triggered.
Workaround	Changing the log level or altering the log line in the python code. Alternative would be to eliminate the trigger, either by re-doing “mmperfmon delete -expiredKeys” or removing all collected performance data in /opt/IBM/zimon/data/

5.1.9.3

System Health / perfmon (Zimon)

IJ50374

With File Audit Logging (FAL) enabled, when deciding to run LWE garbage collector for FAL, an attempt to try-acquire the lock on the policy file mutex is performed. If the policy file mutex is busy, the attempt is canceled and retry on the next attempt. Upon canceling, the policy file mutex can be released without being held leading to the log assert. (show details)

Symptom	Abend/Crash
Environment	All Linux OS environments
Trigger	- FAL is enabled - Generate events in the file system - Make a change to the policy file - Listing policy partitionsSymptom: Abend/Crash
Workaround	- Disable FAL if it is enabled

5.1.9.3

File Audit Logging

IJ50375

GPFS daemon could assert unexpectedly with: Assert exp(0) in direct.C. This could happen on file system manager node after a node failure. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Node failure after repeatedly create/delete same file in a directory
Workaround	None

5.1.9.3

All Scale Users

IJ50439

The ts commands do not always return the correct error code, providing incorrect results to mm commands that call them, resulting incorrect cluster operations. (show details)

Symptom	Failed cluster operations
Environment	All
Trigger	ncorrect cluster information or file system state
Workaround	None

5.1.9.3

Core

IJ50440

mmfsckx fails to detect file having an illReplicated extended attribute overflow block and in the repair mode will not mark the flag illReplicated in it. (show details)

Symptom	Unexpected Results/Behavior
Environment	All supported
Trigger	mmfsckx run on a file system having file having an illReplicated extended attribute overflow block
Workaround	Run offline mmfsck to fix corruption

5.1.9.3

mmfsckx

IJ50441

When scanning a compressed file mmfsckx in some case can incorrectly report a file having bad disk address (show details)

Symptom	Unexpected Results/Behavior
Environment	All supported
Trigger	mmfsckx run on a file system having sparsely compressed files
Workaround	Run offline mmfsck to fix corruption

5.1.9.3

mmfsckx

IJ50442

When scanning a file system having a corrupted snapshot mmfsckx can cause node assert with logAssertFailed: countCRAs() == 0 && "likely a leftover cached inode in inode0 d'tor"* (show details)

Symptom	Node assert
Environment	All supported
Trigger	When scanning file system having a corrupted snapshot
Workaround	Run offline mmfsck to fix corruption

5.1.9.3

mmfsckx

IJ50443

AFM policy generated intermediate files are always put to/var filesystem - /var/mmfs/tmp for Resync/Failover and/var/mmfs/afm for Recovery. We have seen in customer setups that the /var is provisioned very small and there might be other Filesystems that are well provisioned to handle such large files. /opt that IBM defaults to always or may be even inside the fileset. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	Run AFM recovery or resync with a very large filesetin terms of inode space, like 100M or more in it.
Workaround	None

5.1.9.3

AFM

IJ50463

Stale data may be read while "mmchdisk start" is running. (show details)

Symptom	Either no symptom or an fsstruct assert
Environment	all
Trigger	Disks are marked down and data on the disks become stale before "mmchdisk start" is run.
Workaround	Stop all workload before running "mmchdisk start".

5.1.9.3

All Scale Users

IJ50563

In a file system with replication configured, for a large file with number of data blocks more than 5000, if there are miss-updated on some data blocks\ due to disk failures on one replica disk, then these stale replicas would not be repaired if the helper nodes are getting involved to repair them. (show details)

Symptom	replica mismatch
Environment	All Operating Systems
Trigger	I/O errors on disk caused it marked as "down", and some further write failures happen on a large file with the number of data blocks more than 5000, then start the down disk with multiple participant nodes.
Workaround	Only specify the fs mgr node as the participant node for mmchdisk command.

5.1.9.3

Scale Users

IJ50577

When there is a TCP network error, we will try to reconnect the TCP connection, butthe reconnect failed with "Connection timed out" error, which results in node expel. (show details)

Symptom	Node expel/Lost Membership
Environment	ALL Operating System environments
Trigger	Network is not good which leads to TCP connection reconnect
Workaround	No

5.1.9.3

All Scale Users

IJ50708

In a file system with replication configured, the miss-update info set in the disk address could be overwritten by log recovery process, then lead to stale data to be read as well as the start disk process cannot repair such stale replicas. (show details)

Symptom	replica mismatch
Environment	All Operating Systems
Trigger	I/O errors happening and generating miss-update info into the disk address of data blocks, and then a mmfsd daemon crash could result in such problem.
Workaround	No

5.1.9.3

All Operating Systems

IJ50794

Symbolic links may be incorrectly deleted during the offline mmfsck and may cause undetected data loss (show details)

Symptom	Offline mmfsck detects non-critical corruption (corrupt indirection level) and may delete the file if directed to. mmfsck fsName -v -n ... Error in inode 18177 snap 0: has corrupt indirection level 0 Delete inode? no
Environment	ALL Operating System environments
Trigger	When symbolic links are created in the filesystem with a current format version less than or equal to 3.5 and also with IBM Storage Scale V5.1.9.2, they may be incorrectly stored in the inode, even though the filesystem format does not support storing the symbolic links in the inode. This causes offline mmfsck to delete the incorrectly stored symbolic links.
Workaround	None

5.1.9.3

General file system, creation of symbolic links.

IJ50890

Metadata evict was giving error for 2nd attempt onwards. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	running metadata eviction multiple times will trigger issue
Workaround	None

5.1.9.3

AFM

IJ49762

mmlsquota -d can cause gpfs daemon to crash (show details)

Symptom	GPFS daemon can crash when displaying default quota (mmlsquota -d) ifdefault quota if not on.
Environment	ALL Operating System environments
Trigger	Assertion happens because the entryType of root quotaentry has entry type e (explicit) instead of default state.The root quota entry type could have been changed if weedit the quota entry (entry type is changed to EXPLICIT_ENTRY),via mmsetquota, mmedquota, or mmdefedquota commands.When displaying default quota limits (mmlsquota -d), if defaultquota is on, the entry type will revert to "default on" - whichwould not cause the assertion. If default quota is off, theentryType remains e, hitting the assertion when displayingdefault quota limits.Fix: correct the mmlsquota -d processing so that the default quotastatus stored in root quota entries are updated to the expectedvalues, based on quota options in sgDesc, avoiding the assertion.
Workaround	Enable default quota (all types: user, group, fileset) on the filesystem and then run the mmlsquota -d.

5.1.9.3

Quotas

IJ49856

Unexpected long waiter could appear with fetch thread waiting on FetchFlowControlCondvar with reason 'wait for buffer for fetch'. This could happen workload caused all prefetch/writebehind threads are assigned to do prefetching. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	ALL Operating System environments
Trigger	Multiple application perform read/write to the same file at the same time.
Workaround	Avoid concurrently read/write to same file from multiple process.

5.1.9.3

All Scale Users

IJ50061

When mmfsckx is run on a file system such that it requires multiple scan passes to complete then mmfsckx can abort with reason "Assert failed "nEnqueuedNodes > 1"." (show details)

Symptom	Command aborts
Environment	ALL Operating System environments
Trigger	When mmfsckx is run on a file system such that it requires multiple inode scan passes to complete
Workaround	Increase the pagepool to make sure mmfsckx can run in single scan pass.

5.1.9.3

mmfsckx

IJ49583

When a RDMA connection to a remote node has to be shutdown due to network errors (e.g. network link goes down) it can sometimes happen that the affected RDMA connection will not be closed and all resources assigned to this RDMA connection (memory, VERBS Queue Pair, ...) are not freed. (show details)

Symptom	Unexpected Results/Behavior
Environment	ALL Linux OS environments
Trigger	verbsRdmaSend must be enabled. Loss of a RDMA connection to a node because of network errors in the RDMA fabric.
Workaround	No work around available

5.1.9.2

RDMA

IJ49584

Spectrum Scale Erasure code edition interacts with third party software/hardware APIs for internal disk enclosure management.If the management interface becomes degraded and starts to hang commands in the kernel, the hang may also block communication handling threads.
This causes a node to fail to renew its lease, causing it to be fenced off from the rest of the cluster. This may lead to additional outages.A previous APAR was issued for this in 5.1.4, but that fix was incomplete. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux Only
Trigger	Degradation in back-end storage management that causes commands to hang in the kernel
Workaround	The node with hardware problems will show waiters 'Until NSPDServer discovery completes. 'It is recommended to reboot those nodes with those GPFS waiters exceeding 2 minutes if this node is also being expelled.

5.1.9.2

ESS/GNR

IJ49585

If a tiebreaker disk has outdated version info, ccrrestore can abort with Python3 errors (show details)

Symptom	CCR files will not get restored.
Environment	ALL Operating System environments
Trigger	Running "mmsdrrestore ---ccr-repair” on a node that’s upgraded to a new Spectrum Storage release while a tiebreaker disk still has state data from a previous release.
Workaround	None

5.1.9.2

CCR

IJ49659

AFM sets pcache attributes on inode after reading uncached file from home.
It is modifying inode while filesystem is quiesced. Assert is hot due to same. (show details)

Symptom	Assert
Environment	All OS environments
Trigger	Reading uncached file in AFM while filesystem is quiesced
Workaround	None

5.1.9.2

AFM

IJ49586

File systems that have large number independent filesets usually tend to have a sparse inode space. So if mmfsckx is run on such a file system having large sparse inode space then it will take longer to run as it unnecessarily parses over inode alloc map segments pointing to sparse inode spaces instead of skipping them. (show details)

Symptom	Slowness to complete run
Environment	All
Trigger	Run mmfsckx on file systems having large number of independent filesets
Workaround	None

5.1.9.2

FSCKX

IJ49587

When building an NFSv4 ACL from a POSIX access and default ACL of a directory, in between the retrievals of the access ACL and the default ACL, if an update or store ACL to another file or a directory happens, a deadlock can occur and the long waiter message "waiting for exclusive NF ThSXLock for readers to finish" is seen. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	- Have directories with POSIX access and default ACL - Retrieve the NFSv4 ACL of the directories - At the same time, store or update the ACLs of other files/directories - If the store/update occurs in between the retrieval of the access ACL and the default ACL during the process of building the NFSv4 ACL, the deadlock will be hit.
Workaround	- If NFSv4 ACL is needed, use NFSv4 ACL as the native ACL instead of using POSIX ACL, or - Avoid retrieving ACLs of directories as NFSv4 ACLs when their native version are POSIX, or - Use mode bits instead of ACLs.

5.1.9.2

All Scale Users

IJ49660

When replicating over NFS With KRB plus AD - if there's a user who is not included in the AD at the primary site who creates a File, this file is replicated as root to the DR first and then a Setattr is attempted with the User/Group to which file/dir belongs to.
If the user doesn't exist on AD and is local to Primary cluster alone, then NFS prevents the Setattr and ergo the whole Create operation from Primary to DR gets stuck with E_INVAL. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	Trying AFM DR Replication with NFS + KRB + AD. Having a local user who is not present on the AD - leading to NFS rejecting the user id related operations at the DR and leading to queue being stuck.
Workaround	None

5.1.9.2

AFM

IJ49661

cluster health showing "healthy" for disabled CES services (show details)

Symptom	Error output/message
Environment	ALL OS environments
Trigger	Transfer of events from nodes to cluster manager was not working correctly if some field was empty.
Workaround	None

5.1.9.2

System Health

IJ49662

In ceratin cases the network status was not accounted for correctly which could result in "stuck" events like cluster_connections_bad and cluster_connections_down. (show details)

Symptom	Error output/message
Environment	ALL Linux OS environments
Trigger	Network status between nodes when channel was used, failed is then not needed anymore.
Workaround	As a stop-gap solution the events could be ignored, that would however also mute valid events.

5.1.9.2

System Health

IJ49710

For the failed callhome upload remove the job from queue if DC package not available. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	remove/rename the DC packaging file before callhome retry upload schedule
Workaround	None

5.1.9.2

Callhome

IJ49699

Sometimes callhome upload getting failed due to curl(52) error (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	upload large/medium size file using callhome sendfile
Workaround	None

5.1.9.2

Callhome

IJ49700

Sometime exception in logs while callhome sendfile progress converted to integer (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	upload the file using callhome sendfile
Workaround	None

5.1.9.2

Callhome

IJ49701

Processes hang due to deadlocks in our Storage Scale cluster. There aredeadlock notifications on multiple nodes which were triggered by 'long waiter' events on the nodes (show details)

Symptom	Client processes hang and system deadlocks
Environment	Linux Only
Trigger	A single large file being read sequentially from one node (causing a readahead to be performed on the file or by using a posix_fadvise call to trigger readahead forcefully) and also being truncated/deleted from another node at the same time.
Workaround	None

5.1.9.2

Regular file read flow in kernel version >= 5.14

IJ49714

Creating AFM fileset with more than 32 afmNumFlushThreads gives an error (show details)

Symptom	Error output/message
Environment	ALL Operating System environment
Trigger	Creating AFM fileset with more than 32 afmNumFlushThreads gives an error
Workaround	Can create fileset with mmcrfileset afmNumFlushThreads <32 and later we can change this value using mmchfileset.

5.1.9.2

AFM

IJ49715

The 'rpc.statd' may be terminated or experience a crash due to statd-related issues. In these instances, the NFSv3 client will relinquish control over NFSv3 exports,and the GPFS health monitor will indicate 'statd_down'. (show details)

Symptom	The GPFS health monitor will show 'statd_down' warning and NFSv3 client lose control over NFSv3 exports.
Environment	Linux Only
Trigger	'rpc.statd' gets crashed or stopped by an external process.
Workaround	None

5.1.9.2

NFS

IJ49580

When the device file for a NSD disk got offline or unattached from a node, the I/O issued from that node would fail with "No such device or address" error (6), even there are other NSD servers defined andavailable for servicing I/O request. (show details)

Symptom	I/O error
Environment	All Operating Systems
Trigger	The disk device got offline or unattached from a node.
Workaround	reboot the node

5.1.9.2

All Scale Users

IJ49770

AFM object fileset fails to pull new objects from the S3/Azure store when the object fileset is exported via nfs-ganesha and readdir is performed over the NFS mount. However performing the readdir on the fileset directly pulls the entries correctly. (show details)

Symptom	Unexpected results
Environment	All OS enviroments
Trigger	Accessing the AFM object fileset over NFS mount with nfs-ganesha
Workaround	None

5.1.9.2

AFM

IJ49771

AFM outband metadata prefetch hangs if there is an orphan file already exists for the entries in the list file. AFM orphan files have inode allocated but not initialized. (show details)

Symptom	Deadlock
Environment	All OS environments
Trigger	AFM outband metadata prefetch with orphan files
Workaround	None

5.1.9.2

AFM

IJ49772

Damon assert going off: otherP == NULL in clicmd.C, resulting in daemon restart. (show details)

Symptom	Abend/Crash
Environment	All platforms
Trigger	Random occurrence of the condition due to collision of randomly generated numbers
Workaround	None

5.1.9.2

All

IJ49792

Add config option to add nconnect for nfs mount (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	create a fileset and set nfs relationship between target and fileset
Workaround	None

5.1.9.2

AFM

IJ49793

Prefetch is not generating the afmPrepopEnd callback event. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	Run Prefetch on any fileset
Workaround	None

5.1.9.2

AFM

IJ49794

Prefix downloads are getting failed or read or ls fails if prefix option is used with download or fileset creation. (show details)

Symptom	Error output/message
Environment	Linux Only
Trigger	Create fileset with --prefix option and do ls on fileset path or create a fileset and run download with --prefix option
Workaround	run below command on fileset : mmchfileset <fs_name> <fset_name> p afmobjectpreferdir=yes

5.1.9.2

AFM

IJ49795

Rename not reflected to COS automatically if afmMUAutoRemove configured. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux Only
Trigger	Configure fileset with afmMUAutoRemove , do rename on file
Workaround	None

5.1.9.2

AFM

IJ49796

AFM COS to GCS Hangs file system on GCS Errors if credentials doesnt have enough permission. (show details)

Symptom	Stuck IO
Environment	Linux Only
Trigger	With credentials not having read permission do ls on fileset path
Workaround	None

5.1.9.2

AFM

IJ49851

There is crash observed in read_pages when called from page_cache_ra_unbound on SLES with kernel version >=5.14. (show details)

Symptom	The node crashes on SLES 15 machine. It is specific only to kernel version >=5.14
Environment	Linux SLES, kernel version >=5.14
Trigger	A single large file being read sequentially from one node (causing a readahead to be performed on the file or by using a posix_fadvise call to trigger readahead forcefully) and also being truncated/deleted from another node at the same time.
Workaround	None

5.1.9.2

Regular file read flow in kernel version >= 5.14

IJ49852

With showNonZeroBlockCountForNonEmptyFiles set, block count is always shown as one to report fake block count.
This is a work-around for faulty applications (e.g., Gnu tar --sparse) that erroneously assume zero st_blocks means the file contains no nonzero bytes. (show details)

Symptom	Unexpected results
Environment	All OS environments
Trigger	Block count display issue on evicted files
Workaround	None

5.1.9.2

AFM

IJ49142

When running a workload on Windows which creates and deletes lots of files and directories in a short span, the inode number assigned for GPFS objects may be reused. If a stale inode entry somehow persists in the GPFS cache due to in flight hold counts, it can happen that due to conflict between the old and new object types, this stale entry will result in a file or directory not found error. (show details)

Symptom	Unexpected Results/Behavior
Environment	Windows/x86_64 only
Trigger	Running a workload on Windows which continuously creates and deletes lots of files and directories quickly
Workaround	None

5.1.9.1

All Scale Users

IJ49144

When dependent fileset is created inline using afmOnlineDepFset or created offline as in the earlier supported method, we mandate enabling mmafmconfig so that .afm/.afmtrash is present at the DR site insode dep fset, to handle conflict renames that AFM does.
mmafmconfig enable at the DR on dep fset also creates .afmctl file which is CTL attr enabled and disallows anyone from removing it except when done through mmafmlocal. This causes the restore to fail removing the .afmctl inside dep fset when restoring to snapshot without the dep fset.
Fix is to enable mmafmconfig .afm/.afmtrash without creating the .afmctl file which is not needed inside dependent filesets anyways. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only (at the DR site)
Trigger	Running failoverToSecondary with --restore option with dependent filesets inside the independent DR fileset.
Workaround	mmafmconfig disable on all the dependent filesets at the DR site before running failoverToSecondary with --restore option

5.1.9.1

AFM

IJ49145

When failover is performed to an entirely new Secondary fileset at the DR within the same Filesystem as previous target sec fileset - The dependent fileset path We request to link under should change too.
For this the existing dependent fileset is unlinked and when attempted to be linked under new path - since the dependent fileset exists, it returns the E_EXIST and later primary tries to lookup for remoteAttrs and fails the queue. Return E_EXIST if the fileset exists in linked state only so that the follow-up operation from Primary to build remote attributes succeeds. (show details)

Symptom	Unexpected Behavior
Environment	All Linux OS environments (At the DR site)
Trigger	Performing changeSecondary to the same Secondary site, but to a different fileset in it, with a dependent fileset in it.
Workaround	Manually link create/link the dependent fileset at the new Secondary site/fileset

5.1.9.1

AFM

IJ49151

Memory corruption can happen if an application using the GPFS_FINE_GRAIN_WRITE_SHARING hint is running on a file system with its NSD servers having different endianness than the client node the application is running on. (show details)

Symptom	Segmentation fault, assert, or kernel crash
Environment	Linux only
Trigger	Run an application using the GPFS_FINE_GRAIN_WRITE_SHARING hint on nodes with mixed endianness.
Workaround	Don't run an application using the GPFS_FINE_GRAIN_WRITE_SHARING hinton nodes with mixed endianness.

5.1.9.1

Data shipping

IJ49152

When running mmexpelnode to expel the node on which the command is running, we may hit this assert (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	Expel the node on which mmexpelnode is running
Workaround	None

5.1.9.1

All Scale Users

IJ49044

When the file is opened with O_APPEND flag, sequential small read performance is poor (show details)

Symptom	Performance Impact/Degradation
Environment	ALL Operating System environments
Trigger	File is opened with O_APPEND flag
Workaround	None

5.1.9.1

All Scale Users

IJ49154

GPFS daemon could fail unexpectedly with assert when handling disk address changes.
This could happen when number of block in a file become very large and causes a variable used in internal calculation to over flow.
This is more like to happen on file system where block size is very small. (show details)

Symptom	Abend/Crash
Environment	ALL Operating System environments
Trigger	When number of block in a user or system file increases passes certain point. The actual number of block is different depending on block size, replication factor, etc.
Workaround	None

5.1.9.1

All Scale Users

IJ49169

AFM metadata prefetch does not preserve ctime on the files if they are migrated at home. This causes ctime mismatch between cache and home. (show details)

Symptom	Unexpected results
Environment	All Linux OS environments
Trigger	AFM metadata prefetch using AFM when files are migrated at home.
Workaround	None

5.1.9.1

AFM

IJ49196

If COS bucket has same name object and directory object, by default file objects were getting download, when customer requirement was to download directory content instead of files. (show details)

Symptom	Unexpected Behavior
Environment	Linux Only
Trigger	COS bucket having object and directory with same name.
Workaround	None

5.1.9.1

AFM

IJ49197

Exception in mmsysmonitor.log due to some files were getting removed while mmcallhome data collection (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	Change some file(e.g., CCR files) during mmcallhome GatherSend data collection
Workaround	None

5.1.9.1

Callhome

IJ49198

mmcallhome SendFile: progress percentage not updated (show details)

Symptom	Performance Impact/Degradation
Environment	Linux Only
Trigger	Only start and done(100%) was showing when using mmcallhome sendfile to upload
Workaround	None

5.1.9.1

Callhome

IJ49216

Quota manager/client node may assert during per-fileset quota check, when there is being-deleted inode. (show details)

Symptom	Cluster/File System Outage
Environment	ALL
Trigger	To isolate and improve per-fileset quota check logic, invalid filesetId from being-deleted inode is not handled correctly.
Workaround	None

5.1.9.1

Quota

IJ49135

The assert going off on "logAssertFailed: oldDA1Found[i].compAddr(synched1[I])", then result in mmfsd daemon crashed and finally could cause file system can't be mounted on any node. (show details)

Symptom	Abend/Crash
Environment	All Operating Systems
Trigger	Run fsck to fix the duplicated disk address on compressed files.
Workaround	None

5.1.9.0

Compression

IJ48873

File data loss when copying or archiving data from migrated files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface). (show details)

Symptom	Data Loss
Environment	Linux Only
Trigger	Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface.
Workaround	Switch to use other copy or archive tools to copy or archive the data from migrated files, or recall the file before using the copy or archive applications.

5.1.9.0

DMAPI

IJ48871

File data loss when copying or archiving data from snapshot and clone files (e.g., using "cp" or "tar" command that supports to detect sparse holes in source files with lseek(2) interface). (show details)

Symptom	Data Loss
Environment	Linux Only
Trigger	Using the copy or archive tools that support to detect the sparse holes in the source file with the lseek(2) interface.
Workaround	Switch to use other copy or archive tools to copy or archive the data from snapshot and clone files.

5.1.9.0

Snapshot and clone files