Load test Cassandra – The native way – Part 2: The How

In the previous post we talked briefly about why do we need to load test a service. In this post we’ll continue on where we left at the last post on how to load test cassandra. So, lets get started on how to use this amazingly powerful tool; cassandra-stress.

Things to keep in mind:

Load test should be simulating the real-time scenario. So, it is very important to have this setup as close to the one in production. It is highly recommended that we use a separate node/host in proximity to the cluster for load testing (Eg: Deploy the load test server in the same region if you are deployment is in AWS).
Do not use any node from the cluster itself for load testing. It is not unusual to think that, since cassandra-stress is a tool that comes bundled with the cassandra distribution and logically it makes sense to directly use the tool in one of the nodes. Because, cassandra-stress is a heavy-weight process and can consume a lot of JVM resources and can in-turn cloud your node’s performance.
We should also keep in mind that cassandra-stress tool is not actually a distributed program, so in order to test a cluster, we need to make sure that memory is not a bottleneck, so I would recommend to have a host with at-least 16Gigs of memory.

How to use `cassandra-stress`:

Step 1 : The configuration file

The configuration file is the way to let cassandra-stress tool to prepare key-space and table and prepare data for the load test. We need to configure a bunch of properties for defining the keyspace, table, data-distribution for the test and the queries to test.

keyspace	Keyspace name
keyspace_definition	Define keyspace
table	Table name
table_definition	Define the table definition
columnspec	Column Distribution Specifications
inserinsert	Batch Ratio Distribution Specifications
queries	A list of queries you wish to run against the schema

	# Keyspace Name
	keyspace: keyspace_to_load_test

	# The CQL for creating a keyspace (optional if it already exists)
	keyspace_definition: \|
	CREATE KEYSPACE keyspace_to_load_test with replication = {'class': 'SimpleStrategy', 'replication_factor' : '3'}

	# Table name
	table: table_to_load_test

	# The CQL for creating a table you wish to stress (optional if it already exists)
	table_definition: \|
	CREATE TABLE table_to_load_test (
	id uuid,
	column1 text,
	column2 int,
	PRIMARY KEY((id), column1))

	### Column Distribution Specifications ###

	columnspec:
	– name: id
	population: GAUSSIAN(1..1000000, 500000, 15) # Normal distribution to mimic the production load

	– name: column1
	size: uniform(5..20) # Anywhere from 5 characters to 20 characters
	cluster: fixed(5) #Assuming that we would be having 5 distinct carriers

	– name: column2
	size: uniform(100..500) # Anywhere from 5 characters to 20 characters

	### Batch Ratio Distribution Specifications ###

	insert:
	partitions: fixed(1) # We are just going to be touching single partiton with an insert

	select: fixed(1)/5 # We would want to update 1/5th of the rows in the partition at any given time

	batchtype: UNLOGGED # No batched inserts


	#
	# A list of queries you wish to run against the schema
	#
	queries:
	queryForUseCase:
	cql: select * from table_to_load_test where id = ? and column1 = ?
	fields: samerow

view raw cassandra-stress-example.yaml hosted with ❤ by GitHub

Now that we have this configuration file ready, we can use this to run our load test by using the cassandra-stress tool. Lets see how to run the tool now.

Step 2 : Command options

cassandra-stress tool comes bundled with your cassandra distribution download. You will be able to find the tool in apache-cassandra-<version>/tools/bin/.apache-cassandra-<version>/tools/bin/. You can also learn the options available more deeply by checking out the help option in the tool. I will go thru an example and show you how to run the tool in this post.

cassandra-stress user profile=stresstest.yaml duration=4h 'ops(insert=100, queryForUseCase =1)' cl=LOCAL_QUORUM -node <nodelist seperated by commas> -rate 'threads=450' throttle=30000/s -graph file="stress-result-4h-ratelimit-clients.html" title=Stress-test-4h -log file=result.log

Lets go over the options I used one by one to understand what they mean. This is by no means a comprehensive explanation. I would highly recommend giving the documentation a good read to know more about these options.

user	Specify the tool to say that cassandra-stress is used for running a load test on User specified schema.
profile	Specify where the configuration file (yaml file) exist.
duration	Duration for which your load test should run
ops	Operations defined in the yaml file to be included as part the load test. In our example it is `insert` and `queryForUseCase` defined in the yaml file.
cl	Consistency level for your operations
node	Nodes in the cluster
rate	# of threads and peak ops/sec limit
graph	Graphical report of the run. Specify the file name and title of the report
log	Log file name

It is as simple as this. The tool will now run for the duration specified and output a detailed report on the run.

I hope you found this helpful and would certainly be delighted to answer any question regarding this.

cassandra

Load test Cassandra – The native way – Part 1: The Why

Load testing is an imperative part of the software development process. The idea is to test out a feature/service in a prod-like environment with a realistic high load for an unusual time frame, just to gain confidence that the service would not bail out on us during critical times. Quite logical isn’t it? In this series, I’ll go thru my very brief experience load testing a schema in cassandra. So lets get started right way!

With micro-service architecture being a norm at almost every turn in software development, it is worth spending time, talking about how to load test a micro-service. Is it going to be different than testing a monolithic service? Since we say that we test out a near prod-like setup, does it mean that I have to spend a whole lot and setup the same number of nodes that prod-cluster has? But, what if I have some kind of auto-scaling setup? These were a few questions that I had when I had to load-test a micro-service. The answer is quite simple mathematics; extrapolation. We simulate the load to a node and then extrapolate the result. This however, may not be accurate as there may be a few things that might be left out of the equation like network bandwidth, disk I/O, etc. It is also essential to load test the load-balancer to get a clear picture.

But wait! The above method works fine as long as each service have just one responsibility. How about load testing a scenario where the architecture is supposed to perform only if its a part of a cluster?. What do we do, if these processes talk to each other and gossip among them? There are many big-data architectures like this and one such service is cassandra. Fortunately, there is a tool that comes bundled with cassandra for this very purpose; cassandra-stress.

cassandra-stress was initially developed as an internal tool created by developers of cassandra to load test the internals of cassandra. Later, a mode was added to this tool to enable cassandra users test their schema.

I wouldn’t definitely claim that cassandra-stress is the only way to achieve this. In fact, load testing cassandra was possible way before this tool was generally available to test cassandra cluster. My online research yielded the next most popular public choice was to use a JMeter plugin. I choose cassandra-stress because of the obvious fact that its a native tool that comes bundled with cassandra and has a pretty easy learning curve.

Let’s go over how to configure your own load test using the cassandra-stress tool in another post.

cassandra

Cassandra – A shift from SQL

We have had lot of shifts in the paradigm in how we think about persisting data and retrieving them as efficiently as possible. For the longest time, we have had SQL as our goto solution for persisting data. SQL definitely is an awesome solution and that is the reason why it had survived the test of time for so long (Fun Fact: The first commercial RDBMS was Oracle which got released in 1979 [1]) and is definitely not in any space to go down in the near future.

In my opinion, SQL gave us all a near perfect, generic solution for persisting data. SQL gives a lot of importance to things such as data-integrity, atomicity, normalization of data so the data is always in a consistent state whilst maintaining the query performance. Hence, it has got its own way to sort out things among itself with joins, foreign-keys, etc. Of-course, life is not always too fair. This magic that SQL does comes with a price to scale. SQL datastores are often not horizontally scalable and would require manual and application logic to shard the data to decrease the load. One other big challenge with SQL is having a single point of failure. The reason for this is again attributing to its inability to scale horizontally.

We were taught to think database design in this way, and hence that is what we do the best. But majority of the applications do not care about the size of the data or how its stored. On the other hand we don’t actually mind if the data is replicated in multiple locations. In fact, memory has become so cheap now, that we would love to have data duplication where ever possible.

With the intrusion of big data on almost every domain, it does not always make sense to hold on to SQL way of doing things. I’m in no way suggesting that SQL does not have future and everyone who are depending on big data need to resort to NoSQL way of doing things. The idea is to think by prioritizing the feature-set you would need for your datastore to have and, be open to pick and prioritize what you would need. Keep in mind that using both NoSQL and SQL hand-in-hand is not considered a bad practice at all. Just make sure that you are not over engineering your use case.

There are multiple options in the market right now. But, we are going to talk about one such datastore; Cassandra. Why Cassandra? Because, that is the one I have the most insight to talk about. Cassandra, has now reached a very mature state with many big-shots using it as their datastore. Few of them worth mentioning are Netflix, Apple, SoundCloud, etc. So, what is this Cassandra all about? Cassandra is a very powerful, battle-tested data-store that provides high availability with high read and write throughput.

I did talk about letting go few features that relational database provide could give you benefits such as ability to scale horizontally, increase the availability, etc. So, what is the compromise that we have to make if we choose Cassandra? It is the data model.

Data model is the best way you can fine tune your cassandra cluster. Relational databases often deal with creating a logical entity called tables, and relationships among the tables using Foreign-keys. Since Cassandra does not have any kind of joins or relation-ships, the data has to be stored in a denormalized fashion. This is actually not a bad thing in casssandra. But the catch is that, we need to know the query access patterns before designing the model. But if done properly, the performance that we get out of Cassandra is phenomenal. Here is a link that talks about the benchmarking on cassandra cluster at Netflix.

I would like to get this going as a series, so I will stop talking about cassandra now and we’ll start off with how to model your data in cassandra in a different post.

spring, spring batch

Spring-batch useful job-repository queries

One confusing thing when starting with spring-batch project is to understand the ability to query the Job repository database and understand what is happening with the job. The awesome datamodel is best modeled to suit well for the spring-batch to keep track of the jobs its managing, but when it comes for someone to query on a job or view the status of jobs, you might have to jump to multiple tables or write up a query joining tables to figure out what is going on. So I decided to use this space to document all the frequent queries that I use when I am viewing the job repository. I will keep updating this space if I find something interesting or useful.

The queries are pretty straight forward and can be easily derived if you properly understand the job repository model described in the web-page here.

PS: My queries may not be the best, but they work for me and I am happy to receive feedback on improving them. So lets get started with the queries now.

Query #1: Query to list all the jobs with job name and parameters of the job.

Often we would want to get the job name along with the parameters with which it was started, when it started etc. So this query below is handy for me when I want to find out what are all the jobs that are in the job repository starting from the most recent one first.

SELECT
je.JOB_EXECUTION_ID, je.JOB_INSTANCE_ID, ji.JOB_NAME, je.START_TIME, je.END_TIME, je.STATUS, bjep.*,je.EXIT_MESSAGE
FROM
BATCH_JOB_EXECUTION je inner join BATCH_JOB_INSTANCE ji inner join BATCH_JOB_EXECUTION_PARAMS bjep
where je.JOB_INSTANCE_ID=ji.JOB_INSTANCE_ID and je.JOB_EXECUTION_ID = bjep.JOB_EXECUTION_ID order by je.JOB_EXECUTION_ID desc;

Query#2: List all the steps and its status for a job with a job name

Next comes the step information. Finding out which steps are executing what is the status of each step, to what job is this associated to and to what run is it bound to are questions that are often difficult to comprehend and would require jumps to multiple tables. So this query always comes handy to me when I am having such questions.


select
bse.STEP_EXECUTION_ID,bse.JOB_EXECUTION_ID,ji.JOB_NAME, bse.STEP_NAME, bse.START_TIME, bse.END_TIME,bse.COMMIT_COUNT, bse.READ_COUNT, bse.WRITE_COUNT, bse.STATUS, bse.EXIT_MESSAGE, bse.LAST_UPDATED
from
BATCH_STEP_EXECUTION bse inner join BATCH_JOB_INSTANCE ji inner join BATCH_JOB_EXECUTION je
where ji.JOB_INSTANCE_ID = je.JOB_INSTANCE_ID and bse.JOB_EXECUTION_ID = je.JOB_EXECUTION_ID order by bse.job_execution_id desc;

So far these are the two queries that I use very often to find out more information about the jobs. There are a few more I will update once I feel they serve a very common problem. But if there is something that you might want to add, feel free to leave a comment and I will update the post with those.

Vish's brainstorm

Ramblings about all the stuff I find interesting in technology around me.

Tag Archives: database

Load test Cassandra – The native way – Part 2: The How

Things to keep in mind:

How to use `cassandra-stress`:

Step 1 : The configuration file

Step 2 : Command options

Load test Cassandra – The native way – Part 1: The Why

Cassandra – A shift from SQL

Spring-batch useful job-repository queries

Things to keep in mind:

How to use cassandra-stress:

Step 1 : The configuration file

Step 2 : Command options

How to use `cassandra-stress`: