IBM Spectrum Scale APARs Resolved in 5.1.0.x

IJ30552

Migration within the same group pool does not fully rebalance unless higher preferred pools have more available space than less preferred pools. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Running migration policy within the same group pool.
Workaround	None

5.1.0.3

Core GPFS

IJ30553

The administrator is unable to change the page pool setting on the GNR recovery group server. The problem is seen only on recovery groups not managed by mmvdisk. The mmchconfig command will fail, and the following error message is displayed:

The --force-rg-server flag must be used to change the pagepool

(show details)

Symptom	Error output/message Unexpected Results/Behavior
Environment	Linux
Trigger	mmchconfig pagepool=newValue on a GNR server not administered by the mmvdisk command set.
Workaround	None

5.1.0.3

GNR, ESS

IJ30641

logAssertFailed (*respPP != __null) cacheops.C (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Writing to the file in AFM read-only mode to a fileset which is exported using a ganesha NFS server.
Workaround	Export the AFM read-only fileset path in read-only mode from the ganesha NFS server.

5.1.0.3

AFM

IJ30682

After dm_punch_hole call, a dm_get_allocinfo could return improper results for the information of the data blocks allocation. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Calling dm_punch_hole to create hole in a file.
Workaround	None

5.1.0.3

DMAPI

IJ30684

After Node-B successfully reestablishes a broken connection to Node-A, Node-A still shows the reconnect_start state (DEGRADED). (show details)

Symptom	Error output/messages
Environment	All
Trigger	Reconnecting broken connections
Workaround	Restart the system health monitor (mmsysmoncontrol restart).

5.1.0.3

System health

IJ30710

When mmbackup or tsapolicy is called to scan files, it could report "no such file or directory" for existing files. (show details)

Symptom	Unexpected Behavior
Environment	All
Trigger	Running mmbackup or tsapolicy operation while there are file deletions in progress.
Workaround	Rerun the mmbackup or the tsapolicy command again.

5.1.0.3

mmbackup, DMAPI, AFM, tsapolicy, GPFS API

IJ30606

An RPC message could be handled twice when TCP reconnect happens This could cause log assertion, FS struct error or be silently handled depending on the type of RPC. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Network is not good which leads to TCP connection reconnect.
Workaround	None

5.1.0.3

Core GPFS

IJ30681

Hit error 2 while replicating to COS. (show details)

Symptom	Replication halts with remote error 2.
Environment	Linux
Trigger	When remote attrs are set on directory in COS.
Workaround	None

5.1.0.3

AFM

IJ30685

SetAttr operation on renamed object can get requeued forever during AFM replication to the COS backend. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux (AFM gateway nodes)
Trigger	1) Create newObject1 2) SetXattr/SetAttr on newObject1 3) Create newObject2 4) SetXattr/SetAttr on newObject2 5) Rename newObject1 as newObject2
Workaround	None

5.1.0.3

AFM

IJ30712

The CES network status is shown as UNKNOWN on CES nodes if the policy is set to 'node-affinity' and the node is a member of a group. (show details)

Symptom	Error output/message
Environment	Linux (CES nodes)
Trigger	Policy is set to node-affinity and the node is member of a group.
Workaround	An alternative configuration could be to use 'even-coverage' as policy when the CES IP addresses and nodes are assigned to corresponding groups.

5.1.0.3

System health

IJ30777

If the mmvdisk command takes more than 60 seconds to complete, mmhealth reports all pdisks as vanished. On larger systems with many I/O nodes and pdisks, 60 second timeouts are not enough. (show details)

Symptom	Unnecessary events shown in mmhealth / GUI
Environment	Linux, ESS I/O nodes
Trigger	Running mmvdisk command on a system with many I/O nodes and pdisks.
Workaround	Manually increase the time out of the mmvdisk command execution in the mmhealth monitoring.

5.1.0.3

System health, GUI

IJ30793

mmcrfs fails to create file systems when the cluster is configured with minQuorumNodes greater than one and tiebreakerDisks are in use. (show details)

Symptom	Error output/message Unexpected Results/Behavior
Environment	All
Trigger	CCR Cluster minQuorumNodes parameter is set to more than one and tiebreakerDisks are in use.
Workaround	Either set minQuorumNodes back to the default or disable tiebreakerDisks.

5.1.0.3

Admin commands

IJ30757

Upload fails with error 235 when the max-upload-parts option is set to a value higher then what an int data type can hold. (show details)

Symptom	Upload fails with err 235.
Environment	Linux
Trigger	Upload is triggered for large files with write chunk size.
Workaround	Use a lower max-upload-parts value.

5.1.0.3

AFM

IJ30795

Running "mmces events list" stdout prints many trailing white characters (empty spaces) unnecessarily. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Running mmces events list.
Workaround	None

5.1.0.3

CES

IJ30786

Revalidation on a AFM fileset fails on a RHEL 8.3 gateway node and home changes may not be detected causing the data or metadata mismatch between cache and home. (show details)

Symptom	Unexpected Results
Environment	Red Hat Enterprise Linux 8.3
Trigger	AFM caching, resync and failover
Workaround	None

5.1.0.3

AFM, AFM DR

IJ30788

With async refresh enabled, file system quiesce is blocked during the remote operation and it might result in a deadlock if the remote is not responding. (show details)

Symptom	Deadlock
Environment	Linux
Trigger	AFM caching mode with async refresh
Workaround	None

5.1.0.3

AFM

IJ30776

Generating hard link remove operations fails during the recovery with error 22. (show details)

Symptom	Unexpected Results
Environment	Linux
Trigger	AFM recovery
Workaround	None

5.1.0.3

AFM

IJ30822

Daemon (AFM) assert goes off: getReqP->r_length <= ksP->r_bufSize (show details)

Symptom	Unexpected Results
Environment	Linux
Trigger	AFM caching mode and a read of an uncached file.
Workaround	Increase the afmReadBufferSize config value.

5.1.0.3

AFM

IJ30789

GPFS daemon assert going off: "verbsDtoThread_i: fatal device error" in file verbsSendRecv.C or assert "verbsAsyncThread_i: fatal device error" in file verbsInit.C, resulting in a GPFS hard shutdown. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	An RDMA adapter failure or an RDMA adapter reset
Workaround	None

5.1.0.3

RDMA

IJ30826

mmbackup repeats the backup/expire of the same file after file is changed. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Run incremental backup after file is changed
Workaround	None

5.1.0.3

mmbackup

IJ31043

In a mixed AIX/Linux cluster, the mmbackup command could fail with gskit/ssl errors after upgrading IBM Spectrum Protect code to 8.1.11, which introduced new rpm for gskit 8.0-55.17 that is not compatible with gpfs.gskit version.

(show details)

Symptom	Performance Impact/Degradation
Environment	AIX and Linux mixed cluster
Trigger	Run mmbackup in an AIX/Linux mixed cluster with IBM Spectrum Protect 8.1.11.
Workaround	None

5.1.0.3

mmbackup

IJ31044

AFM gateway node crashes if the home is not responding and multiple threads are trying to read the same file. (show details)

Symptom	Crash
Environment	Linux
Trigger	Reading the same file from multiple threads when the home is not responding.
Workaround	Use the undocumented config option "mmfsadm afm readbypass -1" on the gateway node.

5.1.0.3

AFM

IJ30985

AFM gateway node asserts if the home is not responding and multiple threads are trying to read the same file. (show details)

Symptom	Crash
Environment	Linux
Trigger	Reading the same file from multiple threads when the home is not responding.
Workaround	Use the undocumented config option "mmfsadm afm readbypass -1" on the gateway node.

5.1.0.3

AFM

IJ30987

When snapshot is being used, a modification to a file could result in the inode getting copied to the snapshot. If missing, a subsequent inode copy would trigger this assert. (show details)

Symptom	Daemon crash
Environment	All
Trigger	Using snapshot and update to the files in file system.
Workaround	Disable the assert.

5.1.0.3

Snapshots

IJ31060

State of Physical disk shown as "unknown" in mmhealth and GUI for ECE. (show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	ECE, this does not affect GNR/ESS
Workaround	None

5.1.0.3

System health, GUI

IJ31062

When aioSyncDelay config is enabled, the buffer steal and the aio writes that need to be done as buffered I/O may race with each other and causes log assert isSGPanicked in clearBuffer. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Enable aioSyncDelay config and do aio writes.
Workaround	Use mmchconfig aioSyncDelay=0 to disable aioSyncDelay config.

5.1.0.3

Core GPFS

IJ30994

When an RDMA network port for a cluster node dies, all the RDMA connections from all the RDMA adapters are disconnected. The following IBM Spectrum Scale log messages are indicative of this issue:

[W] VERBS RDMA async event IBV_EVENT_PORT_ERR on mlx5_2 port 1. [W] VERBS RDMA pkey index for pkey 0xffff changed from -1 to 0 for device mlx5_2 port 1. [W] VERBS RDMA port state changed from IBV_PORT_ACTIVE to IBV_PORT_DOWN for device mlx5_2 port 1 fabnum 0. [I] VERBS RDMA closed connection to 192.168.12.101 (ems5k) on mlx5_0 port 1 fabnum 0 index 0 cookie 1 due to IBV_EVENT_PORT_ERR [I] VERBS RDMA closed connection to 192.168.12.13 (ece-13 in esscluster.mmfsd.net) on mlx5_2 port 1 fabnum 0 index 11 cookie 12 due to IBV_EVENT_PORT_ERR

(show details)

Symptom	Performance Impact/Degradation
Environment	Linux
Trigger	An RDMA network port becomes unusable due to a network switch issue or other hardware issues.
Workaround	None

5.1.0.3

RDMA

IJ31071

GPFS command reports incorrect default for nsdRAIDMaxRecoveryRetries. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Command output
Workaround	None

5.1.0.3

Admin commands

IJ31021

Deleted inodes with inconsistencies were ignored in IBM Spectrum Scale versions prior to 5.1 but are flagged as a corruption in versions 5.1 and later. If such corruptions exist, then they can cause commands such as 'mmchdisk start' to fail. (show details)

Symptom	Operation failure due to file system corruption
Environment	All
Trigger	Corruption in deleted inodes
Workaround	Run offline fsck repair on 5.1 nodes only.

5.1.0.3

Core GPFS

IJ31111

mmbackup does not honor --max-backup-size in snapshot backup. (show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	Run snapshot backup with the --max-backup-size option
Workaround	None

5.1.0.3

mmbackup

IJ29859

When mmdelnode is issued against a node whose mmfsd daemon is still up, several of the nodes in the cluster can fail with messages like the following:

[E] Deleted node 169.28.113.36 (nodeX) is still up. [E] Node 169.28.113.36 (nodeX) has been deleted from the cluster

(show details)

Symptom	Abend/Crash
Environment	All
Trigger	Issuing mmdelnode against a node whose mmfsd daemon is still up on the target node, or against a node whose status cannot be determined.
Workaround	Ensure that the mmfsd is down on a given node before issuing the mmdelnode command against that node. If the status of the target node cannot be determined, ensure the node gets powered down.

5.1.0.2

Cluster membership

IJ29812

In IBM Spectrum Scale Erasure Code Edition, it is possible for all of the server's pdisks (physical disks) to become missing, either due to network failure, node failure, or through a planned "node suspend" maintenance procedure. When this happens, the system will continue to function if there is sufficient remaining fault tolerance. However, smaller configurations with less ECE nodes are exposed to a race condition where pdisk state changes can interrupt a system-wide descriptor update which causes the recovery group to resign. It is also possible to experience this problem with higher probability when using small ESS configurations, such as the GS1 or GS2 enclosures. For both ESS and ECE, a possible symptom may appear in the mmfs.log in this form when a pdisk state change is quickly followed by a resign message claiming VCD write failures before the system fault tolerance is exceeded: 2020-12-01_19:01:36.696-0400: [D] Pdisk n004p005 of RG rg1 state changed from ok/00000.180 to missing/ suspended/00050.180. 2020-12-01_19:01:36.697-0400: [E] Beginning to resign recovery group rg1 due to "VCD write failure", caller err 217 when "updating VCD: RGD"

Note that a "VCD write failure" with err 217 is a generic message issued when fault tolerance is exceeded during critical system updates, but in this case the race condition resigns the system when only a handful of missing disks are found.

(show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	pdisk state updates and system-wide descriptor updates on configurations with smaller recovery groups, such as a small number of nodes on ECE or a small Enclosure on ESS.
Workaround	Allow the system to resign and recover on its own. For smaller ECE configurations, avoid leaving a node suspended for long periods of time during maintenance tasks.

5.1.0.2

GNR, ESS

IJ29815

"Disk in use" error when using not partitioned DASD devices. DASD '/dev/dasdk' is in use. Unmount it first! mmcrnsd: Unexpected error from fdasd -a /dev/dasd. Return code: 1 mmcrnsd: [E] Unable to partition DASD device /dev/disk/by-path/ccw-0.0.0500 mmcrnsd: Failed while processing disk stanza on node node01.abc.de %nsd: device=/dev/disk/by-path/ccw-0.0.0500 nsd=scale_data01 servers=node01.abc.de usage=dataAndMetadata (show details)

Symptom	Upgrade/Install failure
Environment	Linux (s390x only)
Trigger	Running the mmcrnsd command on not partitioned DASD devices.
Workaround	Before running the mmcrnsd command, partition the DASD devices with fdasd -a, refresh the partition table on all server nodes, and specify the partitions of the devices in the mmcrnsd stanza file. Then, run the mmcrnsd command.

5.1.0.2

Installation toolkit

IJ29917

When a user starts the mmrestripefile command against a big file with -b option, it could take a long time(e.g. more than 20 minutes) to return but no data movement is seen between disks. This is because the big file is already balanced.

(show details)

Symptom	Performance impact
Environment	All
Trigger	Using mmrestripefile to rebalance big files
Workaround	None

5.1.0.2

mmrestripefile

IJ29918

Node crash if tremendous parallel access to file with NFS

(show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	File operation lease is broken.
Workaround	None

5.1.0.2

kNFS

IJ29829

Several RAS events had inconsistent values in their SEVERITY and STATE. For instance, the event "network_bond_degraded", which STATE=DEGRADED, has SEVERITY=INFO. As a result, related failures were not propagated properly.

(show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Conditions for raising of one of the following RAS events are met: - bond_degraded - ces_bond_degraded - ces_ips_all_unassigned - gnr_pdisk_unknown - heartbeat_missing - scale_mgmt_server_failed - tct_cs_disabled exports
Workaround	None. You need to look up the RAS event definition each time to understand if it is a problem.

5.1.0.2

System health, GUI

IJ29857

Starting in version 5.1, the afmIOFlags configuration on the fileset level is printed in Hex format and checking it in the integer range ends up printing an error to the logs. (show details)

Symptom	Error message
Environment	Linux
Trigger	Running a recovery and resync at the fileset level with the fileset level afmIOFlags configuration enabled.
Workaround	None

5.1.0.2

AFM

IJ29555

In a mirrored disk environment, I/O is expected to continue with the surviving disk if a disk experiences a problem. In case of a disk being created with a recovery group vdisk, and the recovery group is in a state that it continues to resign due to some vdisk's fault tolerance exceeded after a successful recovery, a race condition exists which could cause the logic of checking this state to be skipped. As a result of this, I/O will continue to be retried to the problem disk instead of moving on to the surviving disks.

(show details)

Symptom	I/O hang due to long waiters 'waiting for stateful NSD server error takeover (2)'
Environment	All
Trigger	A race condition in the logic that determines if a recovery group is repetitively experiencing a resign after a successful recovery due to a fault tolerance being reached.
Workaround	None

5.1.0.2

GNR

IJ29749

While migrating a file to the cloud, the GPFS daemon might hit a signal in StripeGroup::decNumAccessRights() (show details)

Symptom	Daemon hits a signal during file migration to the cloud.
Environment	Linux
Trigger	Migrating a file to the cloud
Workaround	None

5.1.0.2

TCT

IJ29883

The GPFS daemon could fail with logAssertFailed: fromNode != regP->owner. This could occur when a file system's disk configuration is changed just as a new file system manager is taking over. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Disk configuration change by commands such as mmadddisk, mmdeldisk, and mmchdisk.
Workaround	Avoid disk related commands right after a new file system has been appointed.

5.1.0.2

Core GPFS

IJ29938

skipRecall config does not work.

(show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Running skipRecall config
Workaround	None

5.1.0.2

DMAPI

IJ29884

AIO operations on encrypted files are handled as buffered IO, further decreasing the performance of the AIO operation in addition to the crytographic overhead introduced by the encryption of files in the file system.

(show details)

Symptom	Performance Impact/Degradation
Environment	AIX (POWER only)
Trigger	Using AIO
Workaround	None

5.1.0.2

Encryption

IJ29909

The user action of some health events for unconfigured performance monitoring sensors contain a wrong command.

(show details)

Symptom	The performance monitoring tool can not provide metric data for the component, according GUI panels might be empty.
Environment	Linux
Trigger	A component was enabled or configured (e.g. AFM or SMB) but the performance monitoring was not configured. Remount failure or unmounts on a busy system.
Workaround	None

5.1.0.2

System Health, AFM, QoS, NFS, SMB

IJ29942

In a kernel > 4.10 and file-sizes being a multiple of page sizes, a false error is returned once the read offset reaches file size.

(show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Having a file size that is a multiple of the page size
Workaround	Do not use a file size that is a multiple of the page size.

5.1.0.2

Core GPFS

IJ29943

When there are many threads doing sync writes through the same file descriptor, a contention on inodeCacheOjbMutex between them could impact the performance of writes.

(show details)

Symptom	Sync write performance becomes bad after upgrading to 5.0.x.
Environment	All
Trigger	Multiple threads are doing sync writes on the same opened file concurrently.
Workaround	Enable the config parameter "prefetchAggressisvenessWrite" to work around the problem if the workload is only doing sync writes.

5.1.0.2

Sync writes

IJ29960

AFM gateway nodes run out of memory during resync. glibc is known to use as many arenas as 8 times the number of CPU threads a systems has. This makes a multi-threaded program like AFM which allocates memory for queues to use a lot more memory than actually needed. (show details)

Symptom	Crash/Abend
Environment	Linux
Trigger	AFM resync under heavy workload
Workaround	None

5.1.0.2

AFM

IJ29989

The tsfindinode utility incorrectly reports file path as not found for valid inodes.

(show details)

Symptom	Incorrect output
Environment	All
Trigger	There is no problem trigger as this is a tsfindinode issue.
Workaround	Use mmapplypolicy to get file paths using the following procedure: Create a policy rule file: RULE LIST 'paths' DIRECTORIES_PLUS WHERE INODE=inum1 OR INODE=inum2 OR ... Then, run the policy scan using the preceding rule: mmapplypolicy -f /tmp -P policy.rule -I defer -m 64 View the result in /tmp/list.files

5.1.0.2

tsfindinode

IJ30138

logAssertFailed: ofP->isInodeValid() at mnUpdateInode when doing stat() or gpfs_statlite()

(show details)

Symptom	Abend/Crash
Environment	All
Trigger	A file is actively written on one node and stat() or gpfs_statlite() are repeatedly called on the file on another node.
Workaround	None

5.1.0.2

Core GPFS

IJ30139

mmbackup could backup files unnecessary after failure.

(show details)

Symptom	Performance Impact/Degradation
Environment	All
Trigger	mmbackup failing during IBM Spectrum Protect client and server backup phase
Workaround	None

5.1.0.2

mmbackup

IJ30141

Inodes not getting freed after user deleted them

(show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	A lot of inodes are deleted on a node which overflow background deletion queues.
Workaround	mmfsadm dump deferreddeletions mmfsadm test imapWork fs fullDeletion mmfsadm dump deferreddeletions

5.1.0.2

Core GPFS

IJ30142

mmvdisk --replace command results in message: Location XXX contains multiple disk devices.

(show details)

Symptom	Error output/message
Environment	Linux
Trigger	The problem occurs when hardware errors cause the lower logical block addresses on a disk to become unreadable.
Workaround	Contact IBM support for a workaround disk replacement procedure, or upgrade to a code level with the fix.

5.1.0.2

GNR, ESS

IJ30163

Memory leak on file system manager node during quota revoke storm

(show details)

Symptom	Cluster/File System Outage
Environment	All
Trigger	Quota usage reaches limit setting
Workaround	Restart GPFS.

5.1.0.2

Quotas

IJ30166

mmsmb exportacl list doesn't show "@" of the SMB share name.

(show details)

Symptom	Error output/messages
Environment	Linux
Trigger	Using the '@' character in SMB share name
Workaround	Do not use the '@' character in SMB share names.

5.1.0.2

SMB

IJ30131

Some processes may not be woken up as they should during a cluster manager change. That might lead to potential deadlocks.

(show details)

Symptom	Long Waiters
Environment	All
Trigger	Cluster manager change
Workaround	None

5.1.0.2

Core GPFS

IJ30180

Kernel v4.7 changes the inode ACLs cache mechanism, and GPFS (5.0.5.2+, 4.2.3.23+) does not adapt to the new kernel behaviors. The following two typical issues are observed: 1. normal user can access one file, and root removes the file access privilege from the user by chmod command => the user can still access the file 2. normal user cannot access one file, and root grants the file access privilege for the user by chmod command => the user cannot access the file either.

(show details)

Symptom	Unexpected Results/Behavior
Environment	Linux (Kernel v4.7+)
Trigger	GPFS does not adapt to the new kernel behavior.
Workaround	Remount the file system.

5.1.0.2

Core GPFS

IJ30191

When file audit logging is enabled, events that are generated on non-file system manager nodes will fail to be logged to the audit log.

(show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Enabling file audit logging on 510PTF1 version.
Workaround	None

5.1.0.2

File audit logging

IJ30224

The mmfs.log shows several "sdrServ: Communication error" messages.

(show details)

Symptom	Error output/messages
Environment	All
Trigger	Running the systemhealth monitor
Workaround	Restart the systemhealth monitor (mmsysmoncontrol restart).

5.1.0.2

System health

IJ30248

mmfsd crashed due to signal 11 when verifying the file system descriptor.

(show details)

Symptom	Abend/Crash
Environment	All
Trigger	Descriptor verification
Workaround	None

5.1.0.2

Core GPFS

IJ30337

This problem involves adding, deleting, or changing quorum nodes when GPFS is not running on the majority of quorum nodes. Commands may be left behind a mmRunningCommand lock which prevents GPFS from starting up.

(show details)

Symptom	Error output/message Unexpected Results/Behavior
Environment	All
Trigger	This problem affects mmaddnode, mmdelnode, and mmchnode commands when adding, deleting, or changing quorum nodes while the GPFS daemon is not running on the majority of the quorum nodes in the cluster.
Workaround	If GPFS is not started after running mmaddnode, mmdelnode, or mmchnode command, check the mmfs.log.latest. If a message "GPFS is waiting for commandName" appears in the mmfs.log.latest, confirm that the listed command is not running on any node then use the following command to free the mmRunningCommand lock: # mmcommon freelocks mmRunningCommand

5.1.0.2

Admin commands

IJ30337

The IBM Spectrum Scale HDFS Transparency connector version 3.1.0-6 contains 2 NullPointerExceptions in the HDFS NameNode service. The application accessing the data is not impacted, but these exceptions are seen in the NameNode log file.

(show details)

Symptom	Error output/message
Environment	All
Trigger	Running HDFS Transparency workload.
Workaround	None

5.1.0.2

HDFS Connector

IJ30336

The IBM Spectrum Scale HDFS Transparency connector version 3.1.0-6 modified the label for the open operation when the configuration is set to "Scale" for the ranger.enabled parameter. When retrieving the JMX stats, the open is reported as GetBlockLocations.

(show details)

Symptom	Open counts are reported under the label GetBlockLocations
Environment	All
Trigger	Retrieving JMX stats when ranger.enabled = scale
Workaround	Use GetBlockLocation values instead of Open.

5.1.0.2

HDFS Connector

IJ29025

Given an IBM Spectrum Scale cluster with 'verbsRdmaCm' set to 'enable' and configured to use RDMA through RoCE, individual nodes will fail to establish an RDMA connection to other nodes when the IP addresses configured on the RDMA adapters include a non-link local IPv6 address. (show details)

Symptom	Performance Impact/Degradation
Environment	Linux
Trigger	An IBM Spectrum Scale cluster with 'verbsRdmaCm' set to 'enable' configured to use RDMA through RoCE and RDMA adapters configured with non-link local IPv6 addresses.
Workaround	Remove all non-link local IPv6 addresses from all RDMA adapters which are configured to be used by IBM Spectrum Scale.

5.1.0.1

RDMA

IJ29026

Under heavy workload (especially file creation/deletion involved) with quota function enabled, some race issues are exposed such that the filesetId is not handled correctly, causing a GPFS daemon assert. (show details)

Symptom	Cluster/File System Outage
Environment	All
Trigger	Heavy workload with file creation/deletion involved
Workaround	None

5.1.0.1

Quotas

IJ29027

If the cluster is configured with separate daemon and admin interfaces, the -Y output of mmgetstate only shows the admin node name.

(show details)

Symptom	Command output
Environment	All
Trigger	Run mmgetstate -Y in cluster with a separate daemon and admin interface.
Workaround	Get the daemon node name from other commands such as mmlscluster.

5.1.0.1

Admin commands

IJ29154

mmfsd daemon assert going off: Assert exp(rmsgP != __null) in file llcomm.C, resulting in a daemon crash. (show details)

Symptom	Abend/Crash
Environment	Linux (protocol nodes)
Trigger	The RPC component may hit this assert because of missing a memory barrier.
Workaround	None

5.1.0.1

Core GPFS

IJ29155

While a node is tryiing to join a cluster, mmfsd start could encounter a null pointer reference and crash with a signal 11 with a backstack that looks like this:

[D] #0: 0x0000559601506BCE RGMaster::getNode FullDomainName(NodeAddr, char**) + 0xAE at ??:0 [D] #1: 0x000055960150CAA2 RGMaster::rgListServers(int, unsigned int) + 0x212 at ??:0 [D] #2: 0x000055960145F21C runTSLsRecoveryGroupV2 (int, StripeGroup*, int, char**) + 0xA8C at ??:0 [D] #3: 0x0000559601460371 runTSLsRecoveryGroup (int, StripeGroup*, int, char**) + 0xB1 at ??:0

(show details)

Symptom	Daemon crashes with signal 11
Environment	All
Trigger	A node trying to join a cluster
Workaround	None

5.1.0.1

GNR

IJ29157

logAssertFailed: !"Cleanup hit contended Fileset lock." (show details)

Symptom	Abend/Crash
Environment	All
Trigger	File system specific sync, or fileset cleanup actions like unlink fileset, unmount the filesystem on a given node.
Workaround	None

5.1.0.1

Filesets

IJ29028

mmlsmount with the --report and -Y options may not take into account nodes which do not have the file system mounted.

(show details)

Symptom	Unexpected results
Environment	All
Trigger	File system is mounted on some nodes and not on others.
Workaround	None

5.1.0.1

Core GPFS

IJ29045

mmhealth cluster show: faster heartbeat_missing

(show details)

Symptom	Network outage
Environment	Linux, AIX
Trigger	Network outage
Workaround	Running "mmhealth config interval high" would have a similar effect on heartbeat events but at a cost of a lot of system resources.

5.1.0.1

System health

IJ29161

The mmkeyserv command displays the latest expiration date from the KMIPT certificate chain. It should display the expiration date of the end-entity certificate.

(show details)

Symptom	Error output/message
Environment	Linux, AIX
Trigger	mmkeyserv command "show" output
Workaround	Use openssl to display and view each certificate in the chain to determine the correct expiration date.

5.1.0.1

Core GPFS, Admin commands, Encryption

IJ29162

The systemhealth monitor reports data and name nodes as down for the HadoopConService. In fact, both were running.

(show details)

Symptom	Unexpected Results/Behavior
Environment	Nodes with HadoopConnector installed
Trigger	Running the systemhealth monitor and HadoopConServices.
Workaround	None

5.1.0.1

System health

IJ29163

When truncating a migrated immutable file with DMAPI interfaces, the data of the file becomes zero, although the file is immutable. (show details)

Symptom	Data of the file becomes zero.
Environment	All
Trigger	Truncate operation against migrated immutable file.
Workaround	None

5.1.0.1

Immutable and append-only files

IJ29064

When file audit logging or watch folder is enabled on a file system, unmounting the file system might result in a waiter that will not clear. This may cause other commands to hang. (show details)

Symptom	A persistent waiter that causes commands to hang.
Environment	Linux
Trigger	Unmounting the file system that has file audit logging or watch folder enabled.
Workaround	Recycle the daemon.

5.1.0.1

File audit logging, Watch folder

IJ29186

On clusters having minReleaseLevel at 5.0.1, with mixed version nodes available from 5.0.1.x through till 5.0.5.x nodes and where the gateway node is at level 5.0.5.x - the newer level of gateway nodes finds it hard to co-exist with the older level nodes causing a recovery failure repeatedly. (show details)

Symptom	Unexpected Behavior
Environment	Linux(AFM gateway nodes)
Trigger	Triggering AFM fileset recovery on a mixed version cluster with gateway node at the 5.0.5.x level and minReleaseLevel on the cluster is at the 5.0.1 level.
Workaround	None

5.1.0.1

AFM

IJ28848

Cannot create SMB share using utf8 chars through CLI.

(show details)

Symptom	The mmsmb share name is not created and the command fails.
Environment	Linux
Trigger	A user issuing the mmsmb create command on a non-protocol node.
Workaround	Create the SMB share name (using utf8 character set) on a protocol node directly.

5.1.0.1

SMB

IJ29134

The systemhealth monitor reported a gpfs_down event and triggered a failover even though the system was fine. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	Unknown
Workaround	None

5.1.0.1

System health

IJ29135

mmhealth does not work on AIX. (show details)

Symptom	Component level outage
Environment	AIX (Power only)
Trigger	All new Spectrum Scale installations on AIX 7.1 and AIX 7.2 are generally affected in a major way (Sysmonitor cannot startup). Upgraded installations have some degraded functions.
Workaround	1. Run 'touch mmsysmon.json' 2. Run 'mmccr fput mmsysmon.json mmsysmon.json' 3. Run 'mmdsh -N all mmsysmoncontrol restart' After this workaround Sysmonitor will be able to startup, but some of its features (e.g. hiding events) will stay broken.

5.1.0.1

System health

IJ29002

If the default replication (-m or -r) setting for a file system is set to 1, and mmvdisk is used to add an additional vdisk set to the file system, an exception will be hit if the --failure-groups option is not used. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	File systems using vdisk based NSDs which have either data or metadata replication set to 1.
Workaround	In some cases, using the --failure-groups option avoids this issue.

5.1.0.1

ESS, GNR

IJ29210

mmvdisk recovery group fails when creating log vdisks when creating a new recovery group in a cluster with preexisting recovery groups. An error message "Disk XXX is already registered for use by GPFS" will appear on the command console, and the recovery group creation will fail. Once the problem condition is hit, IBM support must be contacted to correct the conflicting cluster information. (show details)

Symptom	Upgrade/Installation failure
Environment	Linux
Trigger	Users who have used mmchrecoverygroup --servers or mmvdisk recoverygroup to change --primary --backup on a preexisting recovery groups in a cluster may be affected when adding a new recovery group to that cluster.
Workaround	Avoid using the "--servers" option in mmchrecoverygroup or the "--primary" and "--backup" options in mmvdisk recovery group change to invoke planned temporary recovery group reassignment for maintenance operations. Instead, use the "--active" option in mmvdisk or mmchrecoverygroup to specify temporary recovery group reassignment.

5.1.0.1

Admin commands, ESS, GNR

IJ29213

While reading the file, the file can be evicted and its captured checksum shows inconsistency for this opened file.

(show details)

Symptom	File gets evicted even though the file is opened for operations.
Environment	Linux
Trigger	Read/write file and run manual evict on the same file.
Workaround	None

5.1.0.1

AFM

IJ29214

If a hostname is resolved to a loopback address, the CCR component might run into an assertion of type 'ccrNodeId == iter->id' when a node becomes quorum node using the mmchnode or mmaddnode command. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	When a node is added as a quorum node and the hostname is resolved to a loopback address.
Workaround	None

5.1.0.1

Admin commands, CCR

IJ29216

After mmimgrestore, the mmfsd could assert when handling the mmlsfileset command for a dependent fileset: logAssertFailed: fsOfP->getDirLayoutP() != __null (show details)

Symptom	Operation failure due to file system corruption
Environment	All
Trigger	mmimgrestore
Workaround	None

5.1.0.1

DMAPI, HSM, TSM

IJ29248

IBM Spectrum Scale has core dump triggered in dAssignSharedBufferSpace() due to segmentation fault hit by mmfsd or lxtrace daemon.

(show details)

Symptom	Daemon crash
Environment	Linux
Trigger	Enabling overwrite tracing and using internal lxtrace command to recycle the trace data.
Workaround	User should not directly use lxtrace command, instead you should use mmtrace or mmtracectl command to recycle the trace data.

5.1.0.1

Trace

IJ29188

When the file system is in panic on a quota client node, the outstanding quota share is not relinquished. Quota share Indoubt value is reported and the shares can only be reclaimed by mmcheckquota.

(show details)

Symptom	Unexpected Results
Environment	All
Trigger	File system is in panic
Workaround	None

5.1.0.1

Quotas

IJ29190

Incorrect quota check results on small files with fragments

(show details)

Symptom	Unexpected Result
Environment	All
Trigger	Newly created file contains fragments during mmcheckquota execution
Workaround	None

5.1.0.1

Quotas

IJ29201

Incorrect quota check result due to OpenFile reuse/updateShadowTab

(show details)

Symptom	Unexpected Result
Environment	All
Trigger	mmcheckquota
Workaround	None

5.1.0.1

Quotas

IJ29217

--metadata-only option hit the assert Assert exp(!"Assert on Structure Error") in prefetch. (show details)

Symptom	Assert
Environment	Linux
Trigger	Using the --metadata-only option with a prefetch list-file.
Workaround	None

5.1.0.1

AFM, AFM DR

IJ29275

Running prefetch stats is failing with err 22.

(show details)

Symptom	Prefetch stats fail with err 22 (Invalid Argument)
Environment	Linux
Trigger	Prefetch stats are triggered on inactive cache fileset state.
Workaround	None

5.1.0.1

AFM

IJ29243

File system manager could assert with exp(isStoragePoolIdValid(poolId)) during log recovery if a node fails shortly after running mmdeldisk. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	Node failure shortly after deleting a disk from the file system using mmdeldisk.
Workaround	None

5.1.0.1

Core GPFS

IJ29251

On a platform with NUMA support, GPFS may report no platform NUMA support is available.

(show details)

Symptom	Error message
Environment	All
Trigger	None
Workaround	None

5.1.0.1

Linux NUMA subsystem detection of system resources

IJ29252

Installation, update or configuration of the object protocol on 5.1.0.0 fails with a message saying the configuration is not supported or that required dependencies cannot be found. (show details)

Symptom	Upgrade/Installation failure
Environment	Linux
Trigger	Installing or configuring the object protocol in 5.1.0.0.
Workaround	Environments which need the Object protocol will need to stay on a pre-5.1.0.0 version.

5.1.0.1

Object

IJ29253

In cases which has small pagepool size and large file system block size, GPFS may wait for page reservation unnecessarily because GPFS tends to reserve more pages than necessary. (show details)

Symptom	Hang/Deadlock/Unresponsiveness/Long Waiters
Environment	Linux
Trigger	Workloads which access a file both in mmap and regular read/write method
Workaround	None

5.1.0.1

mmap

IJ29255

While the GPFS daemon is shutting down, there is chance a specific trace will be logged and it may crash the kernel. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	mmshutdown when trace is enabled and there is still a workload which accesses the GPFS file system.
Workaround	Stop workload before mmshutdown, or disable mmtrace before mmshutdown.

5.1.0.1

Core GPFS

IJ29261

On zLinux, while running a mmap workload with traceIoData configuration enabled, the trace code may trigger a page fault and cause the kernel to crash. (show details)

Symptom	Abend/Crash
Environment	Linux (s390x only)
Trigger	On zLinux, while running mmap workload with traceIoData configuration enabled, the trace code may trigger page fault and cause kernel crash.
Workaround	Do not enable traceIoData.

5.1.0.1

mmap

IJ29263

mmchnode fails when more than the current number of quorum nodes becomes quorum nodes again. (show details)

Symptom	Error output/message
Environment	All
Trigger	Command 'mmchnode --quorum -N' with number of new quorum nodes greater than the number of current quorum nodes.
Workaround	Restart the GPFS daemon on the nodes which should become quorum nodes before attempting the 'mmchnode --quorum ...' command.

5.1.0.1

mmchnode --quorum, CCR

IJ29356

GPFS maintains EA (Extended Attribute) registry to verify the EA priority. Due to incorrect EA registry addition without SG format version check, policy and inode scans might fail in the mixed node cluster environment. This problem could occur while running policy or inode scans in a mixed node environment running with 5.0.5.2, 5.0.5.3, and 5.1.0.0 and other old version nodes as the file system manager. (show details)

Symptom	Unexpected results
Environment	All
Trigger	Running policy or inode scans in a mixed node cluster environment running with 5.0.5.2, 5.0.5.3, and 5.1.0.0 along with older version nodes as the file system manager.
Workaround	None

5.1.0.1

AFM, Core GPFS

IJ29377

When file audit logging is enabled and audit events are being generated, and a file system is panicked, the IBM Spectrum Scale node where the panic happened may assert. (show details)

Symptom	Abend/Crash
Environment	Linux
Trigger	Panic on a file system with file audit logging enabled
Workaround	None

5.1.0.1

File audit logging

IJ29428

When an uncached file is renamed in the local-updates mode, the file is not copied to the previous snapshot causing the setInodeDirtyAndVerify assert. (show details)

Symptom	Crash
Environment	All
Trigger	Rename of a uncached file in local-updates mode fileset
Workaround	None

5.1.0.1

AFM

IJ29417

If a CCR backup archive is used to restore the CCR component with the command 'mmsdrrestore -F -a', the CCR component might run into an assertion 'myIdFromFile < 0 || iter->id == myIdFromFile' or 'ccrAddr == iter->addr.addr'. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	'mmsdrrestore -F -a' to restore a cluster from a CCR backup file.
Workaround	To avoid conflicting ccr.nodes files in /var/mmfs/tmp, make sure that files starting with ccr.nodes are deleted in this directory before attempting 'mmsdrrestore -F BACKUP_FILE -a'.

5.1.0.1

Admin commands, CCR

IJ29676

The GPFS daemon on a file system manager node could fail with logAssertFailed: nFailedNodes <= 64. This could happen on a large cluster where more than 64 nodes fail around the same time. (show details)

Symptom	Abend/Crash
Environment	All
Trigger	More than 64 nodes fail around the same time.
Workaround	Down grade the manager nodes to the 5.0.x release.

5.1.0.1

Core GPFS

IJ30425

mmvdisk throws an exception for a list operation when the daemon node name is not identical to the admin node name. (show details)

Symptom	Error output/message
Environment	Linux
Trigger	Daemon node name customized and doesn't match admin node name.
Workaround	None

5.1.0.1

GNR, ESS

IJ23984

If the directory passed to the --directory option has spaces or any special characters in the name, then prefetch is not able to handle them correctly. And, it fails printing the usage error to exit. (show details)

Symptom	Error output/message
Environment	Linux, AIX
Trigger	Pass directory names with special character in their name to the directory prefetch option.
Workaround	Prefetch needs to be done with a list file option when the directory prefetch is not able to help.

5.1.0.0

AFM

IJ25712

Due to a delayed file close in the VFS layer and the context mismatch, closing the file after the replication does not wait for the file system quiesce causing the remote log assert. (show details)

Symptom	Assert/Crash
Environment	Linux
Trigger	AFM replication using NSD backend with file system quiesce
Workaround	Prefetch needs to be done with a list file option when the directory prefetch is not able to help.

5.1.0.0

AFM, AFM DR

IJ25754

Quota clients request quota shares based on the workload and most of the time the quota shares given to an active client is much larger than the previously pre-defined amount (e.g. 20 file system blocks). The unused or excess quota shares are returned to the quota manager periodically. At the quota manager side, when the quota usage exceeds the established soft quota limits, the grace period is triggered. When this event occurs, the quota shares are reclaimed and the quota share distribution falls back to a more conservative fashion (based on a predetermined amount). In certain workloads, when the partial quota shares are returned to the manager along with the usage updates and as a result it triggers the soft quota limit exceeded event, some amount of quota shares are lost due to mismanagement of quota shares between the client and the manager. This leads to permanent loss of quota shares correctable by the mmcheckquota command. (show details)

Symptom	Quota shares loss, thus increasing the in-doubt values, caused by the soft quota exceeded events. The loss of shares cannot be reclaimed without running the mmcheckquota command.
Environment	All
Trigger	Timing and workload specific caused by when the quota usage exceeds the soft limits
Workaround	None. Establishing soft quota limits such that the usage is less likely to trigger repeatedly the "soft quota limit exceeded" events minimizes the timing window of hitting this issue.

5.1.0.0

Quotas

IJ25802

There is no automatic method to generate the GPL installable package for customized Red Hat Enterprise Linux releases. (show details)

Symptom	NA
Environment	Linux
Trigger	NA
Workaround	Follow /usr/lpp/mmfs/src/README steps.

5.1.0.0

Core GPFS

IJ25532

The snapshot deletion command fails with error 784(E_ENC_KEY_UNMARSHAL). This is because one of the snapshots file encryption attribute is corrupted. (show details)

Symptom	Unable to delete snapshot
Environment	All
Trigger	A bad encryption attribute exists in the snapshot file; then attempt to delete that snapshot.
Workaround	None

5.1.0.0

Snapshots, Encryption

IJ26654

The message csm_resync_needed should only show up if the communication between a node and the cluster manager was broken for a given amount of time. In mixed version environments, the events "csm_resync_needed" and "heartbeat_missing" messages may be shown erroneously. (show details)

Symptom	Unexpected Results/Behavior
Environment	All
Trigger	If a node is on a newer software release than the cluster manager, some events send out to build the cluster health cannot be handled when they report components which are not known to the cluster manager node. This will trigger retries which will fail as well as finally triggering the events "csm_resync_needed" and "heartbeat_missing". A "mmhealth node show --resync" will not help but would just re-create load on the network.
Workaround	None, other than using the same software version throughout the cluster.

5.1.0.0

System health

IJ26520

mmhealth event hide might take a long time to be included in the state as some checks are done very infrequently.

(show details)

Symptom	Unexpected Results/Behavior
Environment	Linux
Trigger	Command "mmhealth event hide event" is used.
Workaround	The command "mmhealth node show --refresh" can be used to force a refresh of the state.

5.1.0.0

System health

IJ27923

When a user turns off the file system maintenance mode, the file system cannot be mounted.

(show details)

Symptom	Cannot mount the file system
Environment	All
Trigger	Perform some file system operations while the maintenance mode is being turned off.
Workaround	Stop the attempts of file system operations while the maintenance mode is being turned off.

5.1.0.0

Core GPFS

IJ28607

afmFastCreate when set doesn't filter create messages already in the queue when remove comes in. As a result, both a create and a remove on same file can exist on the queue at the same time. If a link gets sandwiched between the create and the remove then the link fails to find the remote file and sees it as a conflict and drops the queue. Causing resync + recovery later to complete the sync. (show details)

Symptom	Unexpected Behavior
Environment	Linux
Trigger	Run a lot of temporary file generating workloads on a SW/DR Primary fileset with afmFastCreate on. During the temporary files getting generated and removed, a link should be created as well.
Workaround	None

5.1.0.0

AFM, AFM DR

IJ27087

Application runs with I/O priority mapping into a not supported QoS class, which does have IOPS limitation with 1 IOPS. Thus, leading to I/Os being queued to wait for enough tokens to service the I/O operation. This causes long waiters. (show details)

Symptom	I/O hang
Environment	All
Trigger	Application runs with lower I/O priority when QoS is being used.
Workaround	None

5.1.0.0

QoS

IJ28608

If call home data collection process was interrupted because of a the power loss, the following data collection of the same schedule will fail due to the directory already existing. (show details)

Symptom	Component Level Outage
Environment	Linux
Trigger	Power loss or restart of the Sysmonitor (mmsysmon.py) during a call home data collection
Workaround	Use the command: mmdsh -N all rm -rf /var/mmfs/tmp/gather//

5.1.0.0

Call home