This article describes the most useful statistics for troubleshooting VMware GemFire Deployments.
Each member of the VMware GemFire
DistributedSystem produces a variety of statistics including ones in these categories:
statistic-sampling-enabled property is set to true, then the statistics are periodically written to an archive file configured by the
statistic-archive-file property. The main way to view the file is to use the Visual Statistics Display (vsd) tool. See the documentation here for additional details on producing the statistics file. See the documentation here for additional details on vsd.
Some of these statistics are helpful in troubleshooting most issues; some are more obscure and only apply to narrow situations.
This article describes the statistics that are most useful when troubleshooting issues, and in some cases, relationships between the statistics.
All of the statistics are grouped into categories. The most useful categories are listed below. The most important statistics in each category are described in the following sections.
VMStats instance groups together all the statistics related to the JVM process including:
fdLimit— indicate the current and maximum number of file descriptors in the JVM retrieved from the
ManagementFactory.getOperatingSystemMXBean(). If the number of open file descriptors reaches the limit, then an exception with ‘Too many open files’ will occur.
processCpuTime— indicates the processing time of the JVM CPU retrieved from
ManagementFactory.getOperatingSystemMXBean(). This statistic shows how much of the total host CPU (see
LinuxSystemStats) is accounted for by the JVM.
threads— indicates the number of threads in the JVM retrieved from the
VMMemoryPoolStats instance groups together all the statistics related to a java heap memory space. Examples include
CMS Old Gen,
Par Eden Space,
G1 Eden Space and
G1 Old Gen. One is created for each of the
MemoryPoolMXBeans provided by
currentUsedMemory— indicates the current heap usage of the JVM
currentMaxMemory— indicates the maximum heap usage of the JVM
VMGCStats instance groups together all the statistics related to a java garbage collector. Examples include
G1 Old Generation and
G1 Young Generation. One is created for each of the
GarbageCollectorMXBeans provided by
collections— indicates the number of garbage collections
collectionTime— indicates the garbage collection time in nanoseconds. Spikes in this statistic may cause members to be disconnected from the
DistributedSystemand may require garbage collection tuning or adjustments to the configured heap or region configuration (e.g. add or change heap LRU eviction).
StatSamplerStats instance groups together all the statistics related to statistic sampling.
delayDuration— indicates the delay between samples taken by the statistics sampler thread . The
statThreadsamples statistics periodically based on the statistic-sample-rate property. If the
statThreaddoesn’t sample when it should, the
delayDurationwill show a spike. This often indicates a resource issue (e.g. GC or CPU) and helps narrow the time frame for investigation.
jvmPauses— indicates the number of JVM pauses. This statistic is incremented when the delay between statistics samples is greater than three seconds. This time is configurable via the gemfire.statSamplerDelayThreshold java system property.
ResourceManagerStats instance groups together all the statistics related to the monitoring of heap usage.
heapCriticalEvents— indicates the number of times the heap usage exceeded the critical heap percentage. The critical heap percentage is the percentage at which the member will accept no more Cache operations. It is configured via the
evictionStartEvents— indicates the number of times the heap usage exceeded the eviction heap percentage. The eviction heap percentage is the percentage at which eviction will begin for regions defined with heap LRU eviction. It is configured via the
PartitionedRegionStats instance groups together all the statistics related to a partitioned region.
bucketCount— indicates the number of buckets defined in the member
primaryBucketCount— indicates the number of primary buckets defined in the member
dataStoreBytesInUse— indicates the number of entry bytes across all the buckets including primaries and secondaries
dataStoreEntryCount— indicates the number of entries across all the buckets including primaries and secondaries
LinuxSystemStats instance groups together all the statistics related to the linux system performance.
cachedMemory— indicates the amount of memory cached in RAM retrieved from
cpuActive— indicates the active CPU percentage retrieved from
freeMemory— indicates the amount of free memory available on the host machine retrieved from
/proc/meminfo. This statistic helps determine if the amount of available memory is adequate for the JVM heap plus native threads.
loadAverage15— indicate the number of running and waiting processes retrieved from
/proc/loadavg. These statistics help determine if the load on the system is too high for the number of CPUs.
physicalMemory— indicates the amount of physical memory on the host retrieved from
recvBytes— indicates the number of bytes received over the network from other members retrieved from
recvDrops— indicates the number of received bytes dropped retrieved from
/proc/net/dev. Non-zero values for this statistic indicate possible network issues.
xmitBytes— indicates the number of bytes transmitted over the network to other members retrieved from
xmitDrops— indicates the number of transmitted bytes dropped retrieved from
/proc/net/dev. Non-zero values for this statistic indicate possible network issues.
DistributionStats instance groups all the statistics related to peer to peer communication and processing.
nodes— indicates the number of members of the
functionExecutionQueueSize— indicate the number of threads in the
functionExecutionPoolused to process Function execution requests and the queue for excess requests when all the threads are in use. The
functionExecutionThreadsstatistic corresponds to the number of
Function Execution Processorthreads (default maximum is the maximum of processors*16 and 100). If the
functionExecutionQueueSizeis consistently greater than zero, then the
functionExecutionPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_FE_THREADS java system property. See my blog here for additional information on when and how Function execution threads are used.
highPriorityQueueSize— indicate the number of threads in the
highPriorityPoolused to process high priority messages (e.g.
RequestImageMessage) and the queue for excess requests when all the threads are in use. The
highPriorityThreadsstatistic corresponds to the number of
Pooled High Priority Message Processorthreads (default maximum is 1000). If the
highPriorityQueueSizeis consistently greater than zero, then the
highPriorityPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_THREADS java system property.
partitionedRegionQueueSize— indicate the number of threads in the
partitionedRegionPoolused to process partitioned region messages (e.g.
DestroyMessage) and the queue for excess requests when all the threads are in use. The
partitionedRegionThreadsstatistic corresponds to the number of
PartitionedRegion Message Processorthreads (default maximum is the maximum of processors*32 and 200). If the
partitionedRegionQueueSizeis consistently greater than zero, then the
partitionedRegionPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_PR_THREADS java system property.
overflowQueueSize— indicate the number of threads in the
threadPoolused to process normal messages (e.g.
ManagerStartupMessage) and the queue for excess requests when all the threads are in use. The
processingThreadsstatistic corresponds to the number of
Pooled Message Processorthreads (default maximum is 1000). If the
overflowQueueSizeis consistently greater than zero, then the
threadPool’smaximum number of threads can be increased by setting the DistributionManager.MAX_THREADS java system property.
sendersTO— indicates the number of outgoing thread-owned (TO) connections to other members. This statistic will only be set with the
conserve-socketsproperty set to false. In that case, when a thread processing a request in one member needs to send a message to another member, it will create and use a dedicated connection to that member. An example is when a
ServerConnectionthread processing a client put request needs to replicate the value to a secondary member. This will cause the remote member to create a dedicated
P2P message readerthread to handle this message and any future messages from the local member and thread. This will increment the
sendersTOstatistic in the local member and the
receiversTOstatistic in the remote member.
receiversTO— indicates the number of incoming thread-owned (TO) connections from remote members. A corresponding
sendersTOwill be incremented in the remote member. This statistic corresponds to the number of
P2P message readerthreads and will only be set with the
conserve-socketsproperty set to false.
senderTimeouts— indicates the number of outgoing thread-owned (TO) connections that have been idle for the
socket-lease-timeproperty (default is 60000 ms) and have been closed. When a thread-owned connection is closed, its corresponding remote
P2P message readerthread will also be closed. The local
sendersTOand the remote
receiversTOstatistics will be decremented. In addition, the local
senderTimeoutswill be incremented. The thread-owned connections between members are created on demand and can be costly to create (especially with SSL). Once they are established, they should be maintained as long as the thread that established them exists. Increasing
socket-lease-time(maximum is 600000 ms) or deactivating it by setting it to zero will help ensure that connections are not closed prematurely.
replyTimeouts— indicates the number of times a thread in one member waited for at least
ack-wait-thresholdseconds (default=15) for a reply from another member. The thread will continue to wait even though the timeout has occurred until either the reply is received or the remote member leaves the
DistributedSystem. This statistic corresponds to a 15 second warning message in the log.
replyWaitsInProgress— indicates the number of threads in one member waiting for a reply from a remote member. This statistic flatlined above zero indicates a permanently stuck thread.
suspectsReceived— indicates the number of suspect messages received from other members whenever a member departs unexpectedly or there are network issues such that a specific member cannot be contacted
suspectsSent— indicates the number of suspect messages sent to other members whenever a member departs unexpectedly or there are network issues such that a specific member cannot be contacted
CacheServerStats instance groups all the statistics related to client to server communication and processing.
currentClients— indicates the number of unique clients that currently have a connection to this server. For long-lived clients, this statistic should be relatively flat.
currentClientConnections— indicates the total number of client connections to this server. This statistic indicates the number of client threads performing Cache operations.
closeConnectionRequests— indicates the number of close connection requests from clients. For long-lived clients, this statistic is an indicator of how often idle client connections are timed-out and closed. This statistic also has a relationship with
receiversTO. Churn in this statistic also means churn in those statistics. Churn in this case means socket connections from the client to the server and from that server to its members being closed and reopened. Since creating socket connections can be expensive (especially for SSL), this statistic should be as close to zero as possible. If there is a lot of churn in this statistic then the client
idle-timeoutproperty should be increased or deactivated. The default is five seconds which is often too low.
connectionsTimedOut— this statistic indicates the number of connections that the server determines have timed out on the client based on the
read-timeoutproperty. Even though the statistic is incremented, the
ServerConnectionthread processing the client request continues processing that request. This statistic should be as close to zero as possible. If not, then the
read-timeoutproperty should be increased.
threadQueueSize— this statistic indicates the number of client requests waiting for a
ServerConnectionthread to process them. It is only applicable if the
max-threadsproperty is set greater than zero. This property causes an
poolto be created. If the
threadQueueSizeis consistently greater than zero, then the
max-threadsproperty should be increased.
CachePerfStats instance groups all the statistics related to Cache usage.
cacheListenerCallsInProgress— indicates the number of CacheListener callbacks in progress. This statistic flatlined above zero indicates a permanently stuck CacheListener.
cacheWriterCallsInProgress— indicates the number of CacheWriter callbacks in progress. This statistic flatlined above zero indicates a permanently stuck CacheWriter.
loadsInProgress— indicates the number of CacheLoader callbacks in progress. This statistic flatlined above zero indicates a permanently stuck CacheLoader.
This article has shown some of the more useful statistics used when troubleshooting issues.