IJ38148 |
High Importance
|
Given a parent directory with the SGID bit set, a file created with the SGID bit specified by a user who does not belong to the same group as the directory can still have the SGID bit set.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger | Create a file with the SGID bit specified as a non-member group user in a directory with the SGID bit set. |
Workaround |
Remove the SGID bit from the directory. |
|
5.1.3.1 |
Core GPFS |
IJ38283 |
High Importance
|
IBM Spectrum Scale Erasure Code Edition interacts with third party software/hardware APIs for internal disk enclosure management. If the management interface becomes degraded and starts to hang commands in the kernel, the hang may also block communication handling threads. This causes a node to fail to renew its lease, causing it to be fenced off from the rest of the cluster. This may lead to additional outages.
(show details)
Symptom |
Hang/Deadlock/Unresponsiveness/Long Waiters |
Environment |
Linux |
Trigger | Degradation in back-end storage management that causes commands to hang in the kernel. |
Workaround |
The node with hardware problems will show waiters 'Until NSPD Server discovery completes.' It is recommended to reboot those nodes with those GPFS waiters exceeding 2 minutes. |
|
5.1.3.1 |
ESS, GNR |
IJ38284 |
Suggested |
On heavy workload system, when some quota entries are deleted explicitly, for example, from fileset deletion, quota manager might hit deadlock.
(show details)
Symptom |
Deadlock |
Environment |
All |
Trigger |
Heavy work loads and quota entries being deleted. |
Workaround |
None |
|
5.1.3.1 |
Core GPFS |
IJ38285 |
High Importance
|
When running IO through KNFS and file audit logging enabled, an invalid pointer might be accessed.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger | Certain patterns of KNFS IO with file audit logging enabled. |
Workaround |
None |
|
5.1.3.1 |
File audit logging |
IJ38286 |
Suggested |
If a listfile's first entry is a directory then all operations are terminated because startmarker failed to setup.
(show details)
Symptom |
Command failed with invalid entries. |
Environment |
Linux |
Trigger |
First entry of a list file is a directory in the --list-file option. |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ38287 |
Critical |
Kernel crash with kernel stack that shows the pemsIpmi functions. The RIP of the kernel crash shows RIP: 0010:kmem_cache_alloc_trace+0x7f/0x1c0.
(show details)
Symptom |
Kernel crash |
Environment |
Linux (x86_64) |
Trigger |
No special trigger. |
Workaround |
None |
|
5.1.3.1 |
ESS, GNR |
IJ38302 |
Suggested |
While replicating a file from a mapped dir path to COS, its finding that parent dir is fileset root and overwrites the assigned bucket for this mapped dir and replicates the file to the old bucket.
(show details)
Symptom |
Files are not synced to the mapped directory. |
Environment |
Linux |
Trigger |
Files are not replicating to the mapped directory. |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ38307 |
Suggested |
The given path for mmafmcosaccess doesn't check whether this path belongs to same fileset or not. Also it needs to check the FS and fileset consistency for the given command.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
The given path for the mmafmcoaccess command doesn't belong to same fileset but it is a valid path. |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ38308 |
High Importance
|
When afmSyncNFSv4ACL is set, ACL buffer size is not verified during the cache refresh. This causes the kernel to crash if the returned buffer length is zero.
(show details)
Symptom |
Crash |
Environment |
Linux |
Trigger | AFM caching with afmSyncNFSv4ACL option |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ38310 |
High Importance
|
Due to a change in procps output in Cygwin version 3.3, IBM Spectrum Scale fails to start.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Windows (x86_64) |
Trigger | IBM Spectrum Scale startup |
Workaround |
Downgrade Cygwin |
|
5.1.3.1 |
Admin Commands |
IJ38311 |
Suggested |
32bit GPFS API library not available in default path on Ubuntu.
(show details)
Symptom |
Error output/message |
Environment |
Linux (x86_64) |
Trigger |
Build an application with 32bit GPFS API library on Ubuntu. |
Workaround |
Modify the build process of the application to search for the 32bit GPFS API library in a different directory. |
|
5.1.3.1 |
GPFS API |
IJ38312 |
Suggested |
mmperfmon delete --expiredkeys fails with a timeout or exception.
(show details)
Symptom |
Error output/message |
Environment |
Linux |
Trigger |
Remote mounted filesystem with a slow or overloaded remote system |
Workaround |
None |
|
5.1.3.1 |
Performance monitoring |
IJ38313 |
High Importance
|
Daemon assert going off: endBufOffset >= 0 && endBufOffset < codeP->getBufMaxPayload(endBuf)
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger | A media error is discovered and fixed on an IBM ESS 3200 system that is using Flash Core Module NVME drives on a specific virtual track boundary. Not all media errors will causes this crash. |
Workaround |
None |
|
5.1.3.1 |
ESS, GNR |
IJ38326 |
Suggested |
Kernel crash when required mount options are missing.
(show details)
Symptom |
Abend/Crash |
Environment |
Linux |
Trigger |
Issue a mount request where the dev= option is missing. Either remove that from /etc/fstab, or issue a mount command that does not read options from /etc/fstab, e.g.: mount -t gpfs /gpfs/fs1 /gpfs/fs1 |
Workaround |
Always have the required dev= mount option available. This is the default in /etc/fstab. |
|
5.1.3.1 |
Core GPFS |
IJ38327 |
Suggested |
mmvdisk recovery group conversion may conflict with settings for nsdRAIDSmallBufferSize from the previous deployment scripts. mmvdisk will apply a value of -1 to this setting, which conflicts with the original value of 256KiB. The result is that the Daemon will print a warning message on start up, warning the user that nsdRAIDSmallBufferSize has been reduced to a value of 4KiB. This may impact performance.
(show details)
Symptom |
Error output/message, Performance Impact/Degradation |
Environment |
Linux |
Trigger |
mmvdisk recovery group conversion from the pre-2020 server config settings. |
Workaround |
Delete the old nsdRAIDSmallBufferSize setting of 256K in SDRFS, or delete any -1 values that were part of the mmvdisk rg conversion override. |
|
5.1.3.1 |
ESS, GNR |
IJ38328 |
High Importance
|
When running mmcheckquota commands concurrently the second command which should fail with "Operation already in progress" doesn't terminate until interrupted.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger | Running more than one mmcheckquota command at the same time. |
Workaround |
Terminate the second mmcheckquota command with ctrl-C. |
|
5.1.3.1 |
Quotas |
IJ38330 |
Suggested |
The '-' char is incorrectly used for a range between two values.
(show details)
Symptom |
It doesn't report issue. |
Environment |
Linux |
Trigger |
When invalid char like ';' is also accepted. |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ38332 |
High Importance
|
The SUID and SGID bits are not cleared after a successful write/truncate to a file by a non-owner.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger | Create a file with the SUID and SGID bits set. As a non-root user or a non-group member user, write to the file with the write() system call or truncate the file with the truncate() system call. |
Workaround |
Ensure that only owners can write to an executable binary file that has the SUID/SGID bit set. |
|
5.1.3.1 |
Core GPFS |
IJ38931 |
Suggested |
Certain file names that contained control characters were not properly escaped when logged by file audit logging / watch folder json format.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Creating a file with control characters in the name |
Workaround |
None |
|
5.1.3.1 |
File audit logging, Watch folder |
IJ38956 |
Suggested |
When running mmhealth config monitor pause, followed by a mmhealth config monitor resume, the THRESHOLD component will stay in DISABLED state.
(show details)
Symptom |
Error output/message |
Environment |
All |
Trigger |
The issue occurs only if the node health monitoring was paused and resumed again. |
Workaround |
Execute the command "mmsysmonc enable thresholds" |
|
5.1.3.1 |
System health |
IJ38958 |
Critical |
On Linux, kernel crash may occur after open() with O_CREAT flag is used and file has been opened already.
(show details)
Symptom |
Kernel crash |
Environment |
Linux |
Trigger |
Using open() with O_CREAT flag on system with Linux kernel 3.10 or higher. |
Workaround |
Avoid using open() with O_CREAT flag |
|
5.1.3.1 |
Core GPFS |
IJ38849 |
HIPER |
Files are not fully cached on AFM COS filesets
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
File read on AFM uncached files. |
Workaround |
Use AFM prefetch to cache the files again. |
|
5.1.3.1 |
AFM |
IJ39003 |
Critical |
NFS status shown as 'unknown'. This may interfere with NFS failover capabilities.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
None |
Workaround |
None |
|
5.1.3.1 |
NFS |
IJ39004 |
Suggested |
Issuing io_uring IORING_OP_READ_FIXED requests to read data into preallocated buffers fails with an error.
(show details)
Symptom |
IO error |
Environment |
Linux |
Trigger |
No pre-conditions are necessary. |
Workaround |
When using io_uring, use IORING_OP_READ instead of IORING_OP_READ_FIXED. This would require changing the application issuing the requests and might come at a performance penalty. |
|
5.1.3.1 |
Core GPFS |
IJ39011 |
High Importance
|
Online replica compare function could incorrectly flag mismatch on the last block of a file when the block was preallocated as a full block and reduced to fragment later.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
All |
Trigger | Run online replica compare on files with preallocated blocks. |
Workaround |
Avoid running online replica compare. |
|
5.1.3.1 |
Core GPFS |
IJ39014 |
High Importance
|
GPU Direct Storage application may encounter an ENONET error when calling cuFileRead().
(show details)
Symptom |
IO error |
Environment |
Linux |
Trigger | When a failed NSD server is rejoining the cluster a fail-back to this NSD server can be triggered. GPU Direct Storage I/O requests can fail during this transition phase. |
Workaround |
None |
|
5.1.3.1 |
RDMA |
IJ39054 |
Suggested |
The command 'gpfs.snap --component-only hadoop' doesn't gather hadoop data. Instead it collects CES data if CES is enabled on the system.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Running 'gpfs.snap --component-only hadoop'. |
Workaround |
Manually running 'hadoop.snap.py' instead of 'gpfs.snap --component-only hadoop'. |
|
5.1.3.1 |
HDFS Connector |
IJ39086 |
Critical |
DMAPI read event is generated on AFM deferred deletion files causing unnecessary recalls if there exists only AFM recovery snapshot.
(show details)
Symptom |
Unexpected results |
Environment |
Linux |
Trigger |
Deletion of migrated files on AFM fileset |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ39089 |
Suggested |
Ganesha crashed with below stack:
#012#5 0x00007f65b55a5e4e state_wipe_file (libganesha_nfsd.so.3.5)
#012#6 0x00007f65b567787c _mdcache_lru_unref (libganesha_nfsd.so.3.5)
#012#7 0x00007f65b56568e2 mdcache_put (libganesha_nfsd.so.3.5)
#012#8 0x00007f65b565adea mdcache_put_ref (libganesha_nfsd.so.3.5)
#012#9 0x00007f65b5619d73 open4_create_fh (libganesha_nfsd.so.3.5)
#012#10 0x00007f65b561c451 open4_ex (libganesha_nfsd.so.3.5)
#012#11 0x00007f65b561d6c0 nfs4_op_open (libganesha_nfsd.so.3.5)
#012#12 0x00007f65b5604cca process_one_op (libganesha_nfsd.so.3.5)
#012#13 0x00007f65b5605d46 nfs4_Compound (libganesha_nfsd.so.3.5)
#012#14 0x00007f65b555b99c nfs_rpc_process_request (libganesha_nfsd.so.3.5)
(show details)
Symptom |
Ganesha Crash |
Environment |
All |
Trigger |
The problem may occur if there are a lot of small files with the same file name created/deleted from NFS clients at the same time. |
Workaround |
None |
|
5.1.3.1 |
cNFS, CES NFS |
IJ39091 |
Suggested |
Ganesha logs below messages.
2022-03-11 14:28:22 : epoch 0009016d : protocol2b :
gpfs.ganesha.nfsd-14806[svc_37] GPFSFSAL_lookup :
FSAL :CRIT :DOTDOT error, inode: 4308074499
2022-03-11 14:28:32 : epoch 0009016d : protocol2b :
gpfs.ganesha.nfsd-14806[svc_48] GPFSFSAL_lookup :
FSAL :CRIT :DOTDOT error, inode: 4308074499
(show details)
Symptom |
DOTDOT error message in ganesha.log |
Environment |
All |
Trigger |
The problem may trigger if .snapshot directory exists and its parent directory have the same inode number. |
Workaround |
None |
|
5.1.3.1 |
cNFS, CES NFS |
IJ39093 |
High Importance
|
An error 22 is hit when trying to get the valid data blocks on a file in resync.
(show details)
Symptom |
Unexpected Behavior |
Environment |
Linux(AFM gateway node) |
Trigger | Running resync with uncached (possibly evicted) files at the SW cache site. |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ39250 |
Suggested |
After refresh interval, Cache bit is getting reset while getobjmetats is triggered on cached file and finding ETAG mismatches.
(show details)
Symptom |
Files get evicted |
Environment |
Linux |
Trigger |
Files get evicted because cache bit gets reset. |
Workaround |
None |
|
5.1.3.1 |
AFM COS |
IJ39251 |
Suggested |
NFS mount point is not getting killed if home fileset is unresponsive or hung. This is causing multiple NFS mount to be created for the same fileset.
(show details)
Symptom |
Too much memory consumption on the NFS mount point. |
Environment |
Linux |
Trigger |
Gateway node is getting more memory consumption on the NFS mount due to existing multiple mount points of the fileset. |
Workaround |
None |
|
5.1.3.1 |
AFM DR |
IJ39252 |
Suggested |
Watch folder events could show an old path to a file if a directory in it's path had recently been renamed
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Rename directories being watched |
Workaround |
None |
|
5.1.3.1 |
Watch folder |
IJ39203 |
Suggested |
mmafmcoskeys failed to set access and secret keys.
(show details)
Symptom |
Access and secret keys fail to set. |
Environment |
Linux |
Trigger |
Trying to set access and secret keys. |
Workaround |
None |
|
5.1.3.1 |
AFM COS |
IJ39265 |
Suggested |
Disk quota error is not reported when a readdir is happening at fileset root.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
readdir on a fileset |
Workaround |
None |
|
5,1,3,1 |
AFM COS |
IJ39369 |
Suggested |
If maxFilesToCache config value is large (for example 100,000,000), iterating the whole gnode hash table may spend more than 22 seconds, and it causes Linux kernel warning message "soft lockup - CPU stuck for 22s!"
(show details)
Symptom |
Warning message |
Environment |
Linux |
Trigger |
Set maxFilesToCache config to large value, for example 100,000,000 |
Workaround |
Lower the maxFilesToCache config |
|
5.1.3.1 |
Core GPFS |
IJ39370 |
Critical |
Stack corruption due to possible buffer overflow.
(show details)
Symptom |
mmfsd restart |
Environment |
Linux |
Trigger |
mmfsd restart at AFM gateway node. |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ39372 |
Critical |
AFM flush thread deadlocks if the remote is not responding as the timeouts afmSyncOpWaitTimeout and afmAsyncOpWaitTimeout config options are not being honored and the stuck threads are not killed.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
AFM caching with unresponsive home |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ39373 |
Critical |
AFM fails to upload the object if the name starts with a '-' character.
(show details)
Symptom |
Deadlock |
Environment |
Linux |
Trigger |
AFM+COS caching with special file names |
Workaround |
None |
|
5.1.3.1 |
AFM |
IJ39374 |
High Importance
|
In huge clusters (lot of performance monitoring data) and on systems with high load on the pmcollector / GUI node, perfmon queries might run into a 5s timeout. This could lead to missing data in the GUI.
(show details)
Symptom |
Component Level Outage |
Environment |
Linux |
Trigger | Huge clusters (lot of performance monitoring data) and on systems with high load on the pmcollector / GUI node |
Workaround |
None |
|
5.1.3.1 |
Performance monitoring, GUI |
IJ39394 |
High Importance
|
GPFS recovery is blocked after cables are pulled and put back, due to a RPC being sent while taking GPFS dumps.
(show details)
Symptom |
Hang |
Environment |
Linux (ESS) |
Trigger | Pull cables and then put the cables back. |
Workaround |
None |
|
5.1.3.1 |
Core GPFS |
IJ39397 |
Suggested |
The IBM Spectrum Scale admin commands and handling of file system encryption keys require the use of more robust settings.
(show details)
Symptom |
None |
Environment |
All |
Trigger |
None |
Workaround |
None |
|
5.1.3.1 |
Admin commands |
IJ37542 |
Critical |
On Linux kernel 3.10 or later, if the O_TRUNC flag is used and the file has been opened already, the O_TRUNC flag might be incorrectly ignored.
(show details)
Symptom |
Unexpected Results/Behavior |
Environment |
Linux |
Trigger |
Using open() with O_CREAT and O_TRUNC flags on a system with Linux kernel 3.10 or later. |
Workaround |
Avoid using open() with O_CREAT and O_TRUNC flags. |
|
5.1.3.0 |
Core GPFS |