[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Sporadic high IO bandwidth and Linux OOM killer

On Thu, Dec 6, 2018 at 3:39 PM Riccardo Ferrari <ferrarir@xxxxxxxxx> wrote:
To be honest I've never seen the OOM in action on those instances. My Xmx was 8GB just like yours and that let me think you have some process that is competing for memory, is it? Do you have any cron, any backup, anything that can trick the OOMKiller ?


As I've mentioned previously, apart from docker running Cassandra on JVM, there is a small number of houskeeping processes, namely cron to trigger log rotation, a log shipping agent, node metrics exporter (prometheus) and some other small things.  None of those come close in their memory requirements compared to Cassandra and are routinely pretty low in memory usage reports from atop and similar tools.  The overhead of these seems to be minimal.

My unresponsiveness was seconds long. This is/was bad becasue gossip protocol was going crazy by marking nodes down and all the consequences this can lead in distributed system, think about hints, dynamic snitch, and whatever depends on node availability ...
Can you share some number about your `tpstats` or system load in general?

Here's some pretty typical tpstats output from one of the nodes:

Pool Name                    Active   Pending      Completed   Blocked  All time blocked
MutationStage                     0         0      319319724         0                 0
ViewMutationStage                 0         0              0         0                 0
ReadStage                         0         0       80006984         0                 0
RequestResponseStage              0         0      258548356         0                 0
ReadRepairStage                   0         0        2707455         0                 0
CounterMutationStage              0         0              0         0                 0
MiscStage                         0         0              0         0                 0
CompactionExecutor                1        55        1552918         0                 0
MemtableReclaimMemory             0         0           4042         0                 0
PendingRangeCalculator            0         0            111         0                 0
GossipStage                       0         0        6343859         0                 0
SecondaryIndexManagement          0         0              0         0                 0
HintsDispatcher                   0         0            226         0                 0
MigrationStage                    0         0              0         0                 0
MemtablePostFlush                 0         0           4046         0                 0
ValidationExecutor                1         1           1510         0                 0
Sampler                           0         0              0         0                 0
MemtableFlushWriter               0         0           4042         0                 0
InternalResponseStage             0         0           5890         0                 0
AntiEntropyStage                  0         0           5532         0                 0
CacheCleanupExecutor              0         0              0         0                 0
Repair#250                        1         1              1         0                 0
Native-Transport-Requests         2         0      260447405         0                18

Message type           Dropped
READ                         0
RANGE_SLICE                  0
_TRACE                       0
HINT                         0
MUTATION                     1
COUNTER_MUTATION             0
BATCH_STORE                  0
BATCH_REMOVE                 0
REQUEST_RESPONSE             0
PAGED_RANGE                  0
READ_REPAIR                  0

Speaking of CPU utilization, it is consistently within 30-60% on all nodes (and even less in the night).

No rollbacks, just moving forward! Right now we are upgrading the instance size to something more recent than m1.xlarge (for many different reasons, including security, ECU and network).Nevertheless it might be a good idea to upgrade to the 3.X branch to leverage on better off-heap memory management.

One thing we have noticed very recently is that our nodes are indeed running low on memory.  It even seems now that the IO is a side effect of impending OOM, not the other way round as we have thought initially.

After a fresh JVM start the memory allocation looks roughly like this:

             total       used       free     shared    buffers     cached
Mem:           14G        14G       173M       1.1M        12M       3.2G
-/+ buffers/cache:        11G       3.4G
Swap:           0B         0B         0B

Then, within a number of days, the allocated disk cache shrinks all the way down to unreasonable numbers like only 150M.  At the same time "free" stays at the original level and "used" grows all the way up to 14G.  Shortly after that the node becomes unavailable because of the IO and ultimately after some time the JVM gets killed.

Most importantly, the resident size of JVM process stays at around 11-12G all the time, like it was shortly after the start.  How can we find where the rest of the memory gets allocated?  Is it just some sort of malloc fragmentation?

As we are running a relatively recent version of JDK, we've tried to use the option -Djdk.nio.maxCachedBufferSize=262144 on one of the nodes, as suggested in this issue:
But we didn't see any improvement.  Also, the expectation is if it would be the issue in the first place, the resident size of JVM process would grow at the same rate as available memory is shrinking, correct?

Another thing we didn't find the answer so far is why within JVM heap.used (<= 6GB) never reaches heap.committed = 8GB.  Any ideas?