codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [EXTERNAL] Writes and Reads with high latency


- How many event_datetime records can you have per pkey? 
during a day of work I can have less than 10 event_datetime records per pkey. 
Every day I maintain maximum 3 of them, so each new event_datetime for a pkey determines a delete and an insert into Cassandra.

- How many pkeys (roughly) do you have?
Few millions but it is going to rise up.


- In general, you only want to have at most 100 MB of data per partition (pkey). If it is larger than that, I would expect some timeouts. I suspect you either have very wide rows or lots of tombstones

I ran some nodetool commands in order to give you more data:

CFSTATS output:

nodetool cfstats my_keyspace.my_table -H
Total number of tables: 52
----------------
Keyspace : my_keyspace
Read Count: 2441795
Read Latency: 400.53986035478 ms
Write Count: 5097368
Write Latency: 6.494159368913525 ms
Pending Flushes: 0
Table: my_table
SSTable count: 13
Space used (live): 185.45 GiB
Space used (total): 185.45 GiB
Space used by snapshots (total): 0 bytes
Off heap memory used (total): 80.66 MiB
SSTable Compression Ratio: 0.2973552755387901
Number of partitions (estimate): 762039
Memtable cell count: 915
Memtable data size: 43.75 MiB
Memtable off heap memory used: 0 bytes
Memtable switch count: 598
Local read count: 2441795
Local read latency: 93.186 ms
Local write count: 5097368
Local write latency: 3.189 ms
Pending flushes: 0
Percent repaired: 0.0
Bloom filter false positives: 5719
Bloom filter false ratio: 0.00000
Bloom filter space used: 1.65 MiB
Bloom filter off heap memory used: 1.65 MiB
Index summary off heap memory used: 1.17 MiB
Compression metadata off heap memory used: 77.83 MiB
Compacted partition minimum bytes: 104
Compacted partition maximum bytes: 20924300
Compacted partition mean bytes: 529420
Average live cells per slice (last five minutes): 2.0
Maximum live cells per slice (last five minutes): 3
Average tombstones per slice (last five minutes): 7.423841059602649
Maximum tombstones per slice (last five minutes): 50
Dropped Mutations: 0 bytes

----------------

CFHISTOGRAMS output:

nodetool cfhistograms my_keyspace my_table
my_keyspace/my_table histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
  (micros)          (micros)           (bytes)
50%            10.00            379.02           1955.67            379022                 8
75%            12.00            654.95         186563.16            654949                17
95%            12.00          20924.30         268650.95           1629722                35
98%            12.00          20924.30         322381.14           2346799                42
99%            12.00          20924.30         386857.37           3379391                50
Min             0.00              6.87             88.15               104                 0
Max            12.00          25109.16         464228.84          20924300               179

I tried to enable 'tracing on' on CQLSH cli and make some queries in order to find out if there are tombstones scanned frequentely
but, in my little sample of queries, I got almost similar answers like the following:

Preparing statement [Native-Transport-Requests-1] 
Executing single-partition query on my_table [ReadStage-2] 
Acquiring sstable references [ReadStage-2] 
Bloom filter allows skipping sstable 2581 [ReadStage-2] 
Bloom filter allows skipping sstable 2580 [ReadStage-2] 
Bloom filter allows skipping sstable 2575 [ReadStage-2] 
Partition index with 2 entries found for sstable 2570 [ReadStage-2] 
Bloom filter allows skipping sstable 2548 [ReadStage-2] 
Bloom filter allows skipping sstable 2463 [ReadStage-2] 
Bloom filter allows skipping sstable 2416 [ReadStage-2] 
Partition index with 3 entries found for sstable 2354 [ReadStage-2] 
Bloom filter allows skipping sstable 1784 [ReadStage-2] 
Partition index with 5 entries found for sstable 1296 [ReadStage-2] 
Partition index with 3 entries found for sstable 1002 [ReadStage-2] 
Partition index with 3 entries found for sstable 372 [ReadStage-2] 
Skipped 0/12 non-slice-intersecting sstables, included 0 due to tombstones [ReadStage-2] 
Merged data from memtables and 5 sstables [ReadStage-2] 
Read 3 live rows and 0 tombstone cells [ReadStage-2]
Request complete


- Since you mention lots of deletes, I am thinking it could be tombstones. Are you getting any tombstone warnings or errors in your system.log?

For each pkey, I get a new event_datetime that makes me delete one of (max) 3 previously saved records in Cassandra.
If an pkey doesn't exist in Cassandra I will store it with its event_datetime without deleting anything.

In Cassandra's logs I don't have any tombstone warning or error.


- When you delete, are you deleting a full partition?

Query for deletes:
delete from my_keyspace.my_table where pkey = ? and event_datetime = ? IF EXISTS;


-  [..] And because only one node has the data, a single timeout means you won’t get any data. 

I will try to increase RF from 1 to 3.


I hope to have answered to all your questions
Thank you very much!

Regards
Marco


Il giorno gio 27 dic 2018 alle ore 21:09 Durity, Sean R <SEAN_R_DURITY@xxxxxxxxxxxxx> ha scritto:

Your RF is only 1, so the data only exists on one node. This is not typically how Cassandra is used. If you need the high availability and low latency, you typically set RF to 3 per DC.

 

How many event_datetime records can you have per pkey? How many pkeys (roughly) do you have? In general, you only want to have at most 100 MB of data per partition (pkey). If it is larger than that, I would expect some timeouts. And because only one node has the data, a single timeout means you won’t get any data. Server timeouts default to just 10 seconds. The secret to Cassandra is to always select your data by at least the primary key (which you are doing). So, I suspect you either have very wide rows or lots of tombstones.

 

Since you mention lots of deletes, I am thinking it could be tombstones. Are you getting any tombstone warnings or errors in your system.log? When you delete, are you deleting a full partition? If you are deleting just part of a partition over and over, I think you will be creating too many tombstones. I try to design my data partitions so that deletes are for a full partition. Then I won’t be reading through 1000s (or more) tombstones trying to find the live data.

 

 

Sean Durity

 

From: Marco Gasparini <marco.gasparini@xxxxxxxxxxxxxxx>
Sent: Thursday, December 27, 2018 3:01 AM
To: user@xxxxxxxxxxxxxxxxxxxx
Subject: Re: [EXTERNAL] Writes and Reads with high latency

 

Hello Sean,

 

here my schema and RF:

 

-------------------------------------------------------------------------

CREATE KEYSPACE my_keyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '1'}  AND durable_writes = true;

 

CREATE TABLE my_keyspace.my_table (

    pkey text,

    event_datetime timestamp,

    agent text,

    ft text,

    ftt text,    

    some_id bigint,

    PRIMARY KEY (pkey, event_datetime)

) WITH CLUSTERING ORDER BY (event_datetime DESC)

    AND bloom_filter_fp_chance = 0.01

    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}

    AND comment = ''

    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}

    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}

    AND crc_check_chance = 1.0

    AND dclocal_read_repair_chance = 0.1

    AND default_time_to_live = 0

    AND gc_grace_seconds = 90000

    AND max_index_interval = 2048

    AND memtable_flush_period_in_ms = 0

    AND min_index_interval = 128

    AND read_repair_chance = 0.0

    AND speculative_retry = '99PERCENTILE';

 

-------------------------------------------------------------------------  

 

Queries I make are very simple:

 

select pkey, event_datetime, ft, some_id, ftt from my_keyspace.my_table where pkey = ? limit ?;

and

insert into my_keyspace.my_table (event_datetime, pkey, agent, some_id, ft, ftt) values (?,?,?,?,?,?);

 

About Retry policy, the answer is yes, actually when a write fails I store it somewhere else and, after a period, a try to write it to Cassandra again. This way I can store almost all my data, but when the problem is the read I don't apply any Retry policy (but this is my problem)

 

 

Thanks

Marco

 

 

Il giorno ven 21 dic 2018 alle ore 17:18 Durity, Sean R <SEAN_R_DURITY@xxxxxxxxxxxxx> ha scritto:

Can you provide the schema and the queries? What is the RF of the keyspace for the data? Are you using any Retry policy on your Cluster object?

 

 

Sean Durity

 

From: Marco Gasparini <marco.gasparini@xxxxxxxxxxxxxxx>
Sent: Friday, December 21, 2018 10:45 AM
To: user@xxxxxxxxxxxxxxxxxxxx
Subject: [EXTERNAL] Writes and Reads with high latency

 

hello all,

 

I have 1 DC of 3 nodes in which is running Cassandra 3.11.3 with consistency level ONE and Java 1.8.0_191.

 

Every day, there are many nodejs programs that send data to the cassandra's cluster via NodeJs cassandra-driver.

Every day I got like 600k requests. Each request makes the server to:

1_ READ some data in Cassandra (by an id, usually I get 3 records),

2_ DELETE one of those records

3_ WRITE the data into Cassandra.

 

So every day I make many deletes.

 

Every day I find errors like:

"All host(s) tried for query failed. First host tried, 10.8.0.10:9042: Host considered as DOWN. See innerErrors...."

"Server timeout during write query at consistency LOCAL_ONE (0 peer(s) acknowledged the write over 1 required)...."

"Server timeout during write query at consistency SERIAL (0 peer(s) acknowledged the write over 1 required)...."

"Server timeout during read query at consistency LOCAL_ONE (0 peer(s) acknowledged the read over 1 required)...."

 

nodetool tablehistograms tells me this:

 

Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count

                              (micros)          (micros)           (bytes)

50%             8.00            379.02           1955.67            379022                 8

75%            10.00            785.94         155469.30            654949                17

95%            12.00          17436.92         268650.95           1629722                35

98%            12.00          25109.16         322381.14           2346799                42

99%            12.00          30130.99         386857.37           3379391                50

Min             0.00              6.87             88.15               104                 0

Max            12.00          43388.63         386857.37          20924300               179

 

in the 99% I noted that write and read latency is pretty high, but I don't know how to improve that.

I can provide more statistics if needed.

 

Is there any improvement I can make to the Cassandra's configuration in order to not to lose any data?

 

Thanks 

 

Regards

Marco

 



The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.




The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.