One of the key features of NDU open-source IoT Platform is data collection and this is a crucial feature that must work reliably under high load. In this article, we are going to describe steps and improvements that we have made to ensure that single instance of NDU server can constantly handle 20,000+ devices and 30,000+ MQTT publish messages per second, which in summary gives us around 2 million published messages per minute.
NDU performance leverages three main projects:
We also use Zookeeper for coordination and gRPC in cluster mode. See platform architecture for more details.
IoT devices connect to NDU server via MQTT and issue “publish” commands with JSON payload. Size of single publish message is approximately 100 bytes. MQTT is lightweight publish/subscribe messaging protocol and offers a number of advantages over HTTP request/response protocol.
NDU server processes MQTT publish messages and stores them to Cassandra asynchronously. The server may also push data to websocket subscriptions from the Web UI dashboards (if present). We try to avoid any blocking operations and this is critical for overall system performance. NDU supports MQTT QoS level 1, which means that a client receives a response to the publish message only after data is stored to Cassandra DB. Data duplicates which are possible with QoS level 1 are just the overwrites to the corresponding Cassandra row and thus are not present in persisted data. This functionality provides reliable data delivery and persistence.
We have used Gatling load testing framework that is also based on Akka and Netty. Gatling is able to simulate 10K MQTT clients using 5-10% of a 2-core CPU. See our separate article about how we improved unofficial Gatling MQTT plugin to support our use case.
The results of first performance tests on the modern 4-core laptop with SSD were quite poor. The platform was able to process only 200 messages per second. The root cause and a main performance bottle-neck were quite obvious and easy to find. It appears that the processing was not 100% asynchronous and we were executing blocking API call of Cassandra driver inside the Telemetry Service actor. Quick refactoring of the service implementation resulted in more than 10X performance improvement and we received approximately 2500 published messages per second from 1000 devices. We would like to recommend this article about async queries to Cassandra.
We have decided to move to AWS EC2 instances to be able to share both results and tests we executed. We start running tests on c4.xlarge instance (4 vCPUs and 7.5 Gb of RAM) with Cassandra and NDU services co-located.
Test specification:
First test results were obviously unacceptable:
The huge response time above was caused by the fact that the server was simply not able to process 10 K messages per second and they are getting queued.
We have started our investigation with monitoring memory and CPU load on the testing instance. Initially, our guessing regarding poor performance was because of the heavy load on CPU or RAM. But in fact, during load testing, we have seen that CPU in particular moments was idle for a couple of seconds. This ‘pause’ event was happening every 3-7 seconds, see chart below.
As a next step, we have decided to do the thread dump during these pauses. We were expecting to see threads that are blocked and this could give us some clue what is happening while pauses. So we have opened separate console to monitor CPU load and another one to execute thread dump while performing stress tests using the following command:
1
2
3
kill -3 THINGSBOARD_PID
We have identified that during pause there was always one thread in TIMED_WAITING state and the root cause was in method awaitAvailableConnection of Cassandra driver:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x0000000092d9d390> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
at com.datastax.driver.core.HostConnectionPool.awaitAvailableConnection(HostConnectionPool.java:287)
at com.datastax.driver.core.HostConnectionPool.waitForConnection(HostConnectionPool.java:328)
at com.datastax.driver.core.HostConnectionPool.borrowConnection(HostConnectionPool.java:251)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.query(RequestHandler.java:301)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:281)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:91)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
at org.thingsboard.server.dao.AbstractDao.executeAsync(AbstractDao.java:91)
at org.thingsboard.server.dao.AbstractDao.executeAsyncWrite(AbstractDao.java:75)
at org.thingsboard.server.dao.timeseries.BaseTimeseriesDao.savePartition(BaseTimeseriesDao.java:135)
As a result, we have realized that the default connection pool configuration for cassandra driver caused bad results in our use case.
Official configuration for connection pool feature contains special option ‘Simultaneous requests per connection’ that allows you to tune concurrent request per single connection. We use cassandra driver protocol v3 and it uses the next values by default:
Considering the fact that we are actually pulling data from 10,000 devices, default values are definitely not enough. So we have done changes in the code and updated values for LOCAL and REMOTE hosts and set them to the maximum possible values:
1
2
3
poolingOptions
.setMaxRequestsPerConnection(HostDistance.LOCAL, 32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 32768);
Test results after the applied changes are listed below.
The results were much better, but far from even 1 million messages per minute. We have not seen pauses in CPU load during our tests on c4.xlarge anymore. CPU load was high (80-95%) during the entire test. We have done couple thread dumps to verify that cassandra driver does not await available connections and indeed we have not seen this issue anymore.
We have decided to run the same tests on twice as more powerful node c4.2xlarge with 8 vCPUs and 15Gb of RAM. The performance increase was not linear and the CPU was still loaded (80-90%).
We have noticed a significant improvement in response time. After significant peak on the start of the test maximum response time was within 200ms and average response time was ~ 50ms.
Number of requests per second was around 10K
We have also executed test on c4.4xlarge with 16 vCPUs and 30Gb of RAM but have not noticed significant improvements and decided to separate NDU server and move Cassandra to three nodes cluster.
Our main goal was to identify how much MQTT messages we can handle using single NDU server running on c4.2xlarge. We will cover horizontal scalability of NDU cluster in a separate article. So, we decided to move Cassandra to three c4.xlarge separate instances with default configuration and launch gatling stress test tool from two separate c4.xlarge instances simultaneously to minimize the possible affect on latency and throughput by thirdparty.
Test specification:
The statistics of two simultaneous test runs launched on different client machines is listed below.
Based on the data from two simultaneous test runs we have reached 30 000 published messages per second which is equal to 1.8 million per minute.
We have prepared several AWS AMIs for anyone who is interested in replication of these tests. See separate documentation page with detailed instructions.
This performance test demonstrates how a small NDU cluster, that costs approximately 1$ per hour, can easily receive, store and visualize more than 100 million messages from your devices. We will continue our work on performance improvements and are going to publish performance results for the cluster of NDU servers in our next blog post. We hope this article will be useful for people who are evaluating the platform and want to execute performance tests on their own. We also hope that performance improvement steps will be useful for any engineers who use similar technologies.
Please let us know your feedback and follow our project on Github or Twitter.