How to install KairosDB time series database?

In my previous post, I described why I switched from OpenTSDB (other time series database) to KairosDB. In this post, I will show how to install and run KairosDB.

Requirements

To run KairosDB we actually just need KairosDB (if we ignore Ubuntu/Debian/something similar and Java). How is that possible? Well, KairosDB supports two datastores: H2 and Cassandra. H2 is actually an in memory H2 database. It’s easy to setup and cleanup, and it’s mostly used for development. Don’t use it in the production; it will work, but it will be very very slow.

For our tutorial we will use Cassandra as datastore. To install Cassandra, you can follow the official tutorial at http://wiki.apache.org/cassandra/GettingStarted. We will install it via apt-get.

You will want to replace 21x by the series you want to use: 20x for the 2.0.x series, 12x for the 1.2.x series, etc… You will not automatically get major version updates unless you change the series, but that is a feature.

We also need to add public keys to be able to access debian packages.

Now we are ready to install it.

This will install the Cassandra database. Few things you must know is that the configuration files are located in /etc/cassandra, and the start-up options (heap size, etc) can be configured in /etc/default/cassandra. Now that Cassandra is install, run it.

Another requirement is that you have Oracle Java JDK instead of OpenJDK. You must install version 7 or 8 (8 is recommended, I’m using 7). Again, we will install it with apt-get.

Source: http://stackoverflow.com/a/16263651/73010

KairosDB uses Thrift for communicating with Cassandra. When I installed Cassandra, it wasn’t enabled by default. So I had to enable it first. There are many ways and if you hate to fiddle with config files, you can install OpsCenter. It’s a really great tool for monitoring your cluster. It has a simple interface where you can access your nodes and change their configuration to enable Thrift. To change it the in the config file, update start_rpc setting to true in /etc/cassandra/cassandra.yaml.

Installing KairosDB

We can again install KairosDB in few ways.

a) Building from the source

a) Clone the git repository https://github.com/kairosdb/kairosdb.git
b) Make sure that JAVA_HOME is set to your java install.
c) Compile the code

b) Installing via .deb package (recommended)

Current stable version is 0.9.4 1.1.1. Make sure you download the latest version at https://github.com/kairosdb/kairosdb/releases.

Setting Cassandra as a datastore

As mentioned before, KairosDB by default uses H2 database for datastore. We need to change it to Cassandra.

a) If you are running from source, then copy kairosdb.properties to KairosDB root folder from src/main/resources/ folder to change it.
b) If you installed it, then change the file /opt/kairosdb/conf/kairosdb.properties.

In the file comment the line where H2 is set as datastore and uncomment Cassandra module. So the file should look like this.

You can also change some other setting to tune it, but for now just save it and you are ready to go.

Test if everything works

Make sure your Cassandra service is running. Now lets run KairosDB.

a) Running from source

b) Or if installed

Go to http://localhost:8080 to check if everything works OK. If you can see KairosDB dashboard, then congratulations, you can now use KairosDB.

What’s next

In the next tutorial we will see how to save, query and delete datapoints with web interface, HTTP and Telnet API.

Why I switched from OpenTSDB to KairosDB?

In my previous post, I described how to correctly install and use OpenTSDB. After some time, I decided to move on to other solution.

The story

Before everything, we need to know one thing. Because of IoT, the demand for storing sensor data has increased dramatically. Many new projects emerged, some are good, some are bad. They are different in technologies used, how fast they are and what kind of features they support.

You can read the full list of all IoT timeseries databases that can be used for storing data of you Internet of Things projects or startup.

Problems of OpenTSDB

OpenTSDB is great, don’t get me wrong. But when you try to use is with some more complex projects and customer demands, you can quickly hit the wall. It’s mostly because it involves a lot of moving parts to make it work (Hadoop, HBase, ZooKeeper). If one of the parts fail, the whole thing fails. Sure, you can replicate each thing and make it more robust, but you will also spend more money. When you are starting, it’s a over optimization and waste of money (that you don’t have).

Aggregation of the data is another problem. It does support basic function like min, max, avg etc. I spent days investigating the problem why avg aggregation is not working correctly when I filter by multiple tags. It just didn’t want to work and I couldn’t find anything in the docs. I asked on Google group and after some time I got a reply that I must use another aggregation function and that even that doesn’t work 100% as I want it. Another problem is when I want to get just one value – for example avg of all values from X to now. Not possible!

No clients to talk with OpenTSDB is another problem for me. Sure, storing the data with socket API is super simple and can be easily integrated in every language. The HTTP API is another story. Sure, again it shouldn’t be a problem to implement my own client, but why waste time with this?

Development of the OpenTSDB is slow and it takes ages for new features to be integrated. One of them (one of the most important for me) is an ability to support time zones. It’s used when downsampling data to one day (or even more) so data is correctly grouped. There was some work, but until today it still wasn’t implemented. Too bad.

On the bright side, OpenTSDB is super fast. I was able to store and load data as super fast rate – loading 3 million records in few seconds is for me super fast. Try it with relational database and you will be quickly disappointed.

KairosDB to the rescue

I remember when I was doing a research, I noticed KairosDB but I didn’t spend too much time testing it. It just wasn’t appealing and I didn’t know how it actually works. Big mistake.

KairosDB uses Cassandra to store data (compared to HBase used with OpenTSDB) and it’s actually a rewritten and upgraded version of OpenTSDB. It has evolved into great project. It has many more features: many more (and fully working) aggregation methods, option to easily delete metric or datapoint, easy extensibility with plugins etc. It has great clients and has much more active community. I remember when I asked a question on OpenTSDB Google group and waited weeks for an answer (I’m not forcing anyone to provide the support, because after all, it’s an opensource project), while on KairosDB Google group I got it within a day.

Why is this important you might ask? Well, when you are catching deadlines and something goes wrong, responsive community is very important. Sometimes this kind of things can be a difference between success and a failure.

What now?

I wrote an tutorial how to start with KairosDB. You can also you visit kairosdb.org and check out the documentation. Feel free to play with it, test it and hopefully also use it in production. I

The complete list of all time series databases for your IoT project

While searching the perfect database for my project, I spent hours and hours searching the internet and making a list of all candidates. Quickly I realized that the list is pretty long and the projects differ in many ways, but all of the have the same goal: store your time series data.

How the data looks like

The structure of time series data always consists of at least 2 parts (we call it a datapoint): time and value. At certain time we have a certain value. Based on architecture of different time series databases, we also annotate the datapoint with additional information. The goal of the information is to better differentiate the data and filter it easier. One of the examples is adding source=device.1 tag to the datapoint. Later we can easily fetch all the data that belongs to device.1.

So in some context, time series databases are similar to key-value databases where key is combination of time and tags. Only difference is, that we have a better ability to filter the data and all the nice aggregations functions (min, max, avg, dev, ..) are already built in (well in most cases). As with key-value databases everything starts with a key, in time series databases everything starts with time.

Below is the list of all the time series databases I found that follow the previous mentioned principle. If you find a new unlisted database or you create a new one and want to share, send me an email at erol@(enter my domain name).

The list

1. OpenTSDB
Website: http://www.opentsdb.net/
Pricing: Free
Technologies: Java, HBase
Presentation: http://youtu.be/WlsyqhrhRZA

Store and serve massive amounts of time series data without losing granularity.

2. KairosDB
Website: http://www.kairosdb.org
Pricing: Free
Technologies: Java, Cassandra
Clients: Java, Python
Presentation: http://youtu.be/Ykf_C9RZEQI?t=31m15s

KairosDB is a fast distributed scalable time series database written on top of Cassandra.

3. InfluxDB
Website: http://influxdb.com/
Pricing: Free
Technologies: Go, BoltDB
Clients: JavaScript, Ruby, Python, Node.js, PHP, Java, Clojure, Common Lisp, Go, Scala, R, Erlang, Perl, Haskell, .NET
Presentation: http://youtu.be/sRi64imN7xg

InfluxDB is a time series, metrics, and analytics database. It’s written in Go and has no external dependencies. That means once you install it there’s nothing else to manage (like Redis, ZooKeeper, HBase, or whatever).

4. TempoIQ
Website: https://www.tempoiq.com/
Pricing: Subscription
Clients: .NET, Java, Node.js, Python, Ruby
Presentation: http://youtu.be/TRv0tfFAdbY

Fast, scalable monitoring & analysis of sensor data in your application.

5. Graphite
Website: https://github.com/graphite-project

Graphite is a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested in graphing, and send it to Graphite’s processing backend, carbon, which stores the data in Graphite’s specialized database. The data can then be visualized through graphite’s web interfaces.

6. Druid
Website: http://druid.io/
Pricing: Free
Technologies: Java
Clients: Ruby, Python, R, Node.js
Presentation: http://youtu.be/Dlqj34l2upk

An open-source, real-time data store designed to power interactive applications at scale.

7. kdb+
Website: http://kx.com/
Technologies: K
Clients: Java, .NET, Python, Excel
Presentation: http://youtu.be/AGGGU7tVdEk

The high-performance database that sets the standard for time-series analytics.

8. RRDtool
Website: http://oss.oetiker.ch/rrdtool/

RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data. RRDtool can be easily integrated in shell scripts, perl, python, ruby, lua or tcl applications.

9. seriesly
Website: https://github.com/dustin/seriesly
Technologies: Go

seriesly is a database for storing and querying time series data. Unlike databases like RRDtool, it’s schemaless so you can just lob data into it and start hacking. However, it also doesn’t use a finite amount of space.

10. Cube
Website: http://square.github.io/cube/ (development seems stopped, most active fork is https://github.com/red-gate/cube)
Pricing: Free
Technologies: Node.js, MongoDB

Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets. Cube is built on MongoDB.

11. IBM Informix
Website: http://www-01.ibm.com/software/data/informix/

Informix, with its TimeSeries feature, helps organizations solve the Big Data challenge of sensor data by providing unprecedented performance and scalability to applications that leverage time series data.

12. Akumuli
Website: http://www.akumuli.org/
Pricing: Free
Technologies: C++

Distributed time-series database

13. BlueFlood
Website: http://blueflood.io/
Technologies: Java
Presentation: http://vimeo.com/87210602

Blueflood is a multi-tenant distributed metric processing system created by engineers at Rackspace. It is used in production by the Cloud Monitoring team to process metrics generated by their monitoring systems. Blueflood is capable of ingesting, rolling up and serving metrics at a massive scale.

14. DalmatinerDB
Website: https://dalmatiner.io/
Technologies: Erlang, ZFS, Riak Core

DalmatinerDB is a no fluff purpose built metric database. Not a layer put on top of a general purpose database or datastores.

15. Rhombus
Website: https://github.com/Pardot/Rhombus
Pricing: Free
Technologies: Java, Cassandra

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

16. Prometheus
Website: http://prometheus.io/
Pricing: Free
Technologies: Go
Clients: Go, Java, Ruby

An open-source service monitoring system and time series database.

17. Axibase Time-Series Database
Website: http://axibase.com/products/axibase-time-series-database/
Pricing: Free & License version
Technologies: Java, HBase, Hadoop
Clients: Java, R Language, PHP, Python, Ruby, JavaScript

Axibase Time-Series Database (ATSD) is a next-generation statistics database. ATSD is for companies that need to extract value from large amounts of time-series data which exists in their IT and operational infrastructure.

18. Newts
Website: http://opennms.github.io/newts/
Pricing: Free
Technologies: Java, Cassandra

A time-series data store based on Apache Cassandra.

19. InfiniFlux
Website: http://www.infiniflux.com/
Pricing: Free & License version
Technologies: C
Clients: Java, Python, JavaScript, R, PHP​

INFINIFLUX is the World’s Fastest Time Series DBMS for IoT and BigData.

20. Heroic
Website: https://spotify.github.io/heroic/
Pricing: Free
Technologies: Java, Cassandra, Elasticsearch

The Heroic Time Series Database

21. Riak TS
Website: http://basho.com/products/riak-ts/
Pricing: Free & License version
Technologies: Erlang
Clients: Java, Ruby, Python, Erlang, Node.js
Presentation: https://www.youtube.com/watch?v=l-U-oSnpdLQ

Riak TS is the only enterprise-grade NoSQL database optimized for IoT and Time Series data

22. The Warp 10 Platform
Website: http://www.warp10.io/
Pricing: Free
Technologies: Java, WarpScript
Presentation: http://www.slideshare.net/Mathias-Herberts/warp-10-platform-presentation-criteo-beer-tech-20160203

Warp 10 is an Open Source Geo Time Series® Platform designed to handle data coming from sensors, monitoring systems and the Internet of Things.

23. KsanaDB
Website: https://github.com/zzzmanzzz/KsanaDB
Pricing: Free
Technologies: Go, Redis

KsanaDB is a time series database, base on redis and go.

24. eXtremeDB DBMS
Website: http://financial.mcobject.com/
Pricing: License version
Presentation: https://youtu.be/lG7Fw1sHFKQ

The eXtremeDB DBMS product family delivers high levels of scalability, reliability and processing speed for storing and manipulating complex data, and is used successfully in environments ranging from Big Data analytics in the data center to supporting increasingly “smart” features in the resource-constrained devices comprising the Internet of Things.

25. SiriDB
Website: http://siridb.net/
Pricing: Free
Technologies: C

SiriDB is a highly-scalable, robust and fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB’s unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.

26. DB4IoT
Website: http://db4iot.com/
Pricing: License version
Presentation: https://youtu.be/X-4OB_8SzTI

Visualize and analyze time-series IoT data with blazing-fast interactive maps for the
“Internet of Moving Things.”

27. quasardb
Website: http://www.quasardb.net/
Pricing: Client API Open Source, Community Edition and Enterprise Edition
Technologies: C++14
Clients: Java, .NET, Python, Excel, C, C#, C++, R

Quasardb is a high-performance, distributed, column-oriented database with native time series support.

I will frequently update the list and add new time series databases as they come along.

pg_dump: permission denied for relation mytable – LOCK TABLE public.mytable IN ACCESS SHARE MODE

One of the good practices is to create backups of your database at regular intervals. If you are using PostgreSQL database, you can use built-in tool called pg_dump. With pg_dump we can export the database structure and data. In case we want to dump all databases, then we can use pg_dumpall.

When I was creating a simple bash script, I was getting a very strange error: pg_dump: permission denied for relation mytable – LOCK TABLE public.mytable IN ACCESS SHARE MODE. Googling around I got few tips how to solve the problem, but no actual solution.

Script to dump

To make our life easier, we use a script to make the whole process easier. It’s also convenient to have a script which we can later call from other processes, from build tools (backup because upgrading) or with cron.

When we run this, we get previous error. Big problem.

Locked table problem

Problem is with permissions. There are multiple permission layers. First is if we actually have an access to database. Second layer is if we actually have an access to table; in our case table mytable. To check it, we need to see the structure and permissions of the table.

Above commands with output all the tables in the database. If we check the columns, we will notice that there is a owner column. In our case it’s important that table owner and export user is same, otherwise we get the permission problem.

To change the permission of the table, we need to run the command

The command will alter the table ownership to our export user. Be sure you change owner for every table in the selected database.

Extra tip – cron

Of course we don’t have time and we especially don’t want to waste time for tasks that can be automated. One of them is actually running our dbbackup.sh every week, month or at some interval you desire. To perform backups every week, we can use cron.

To add a cron job, just run

It will show simple editor where you write your tasks/jobs. In our case, we will run backup every Sunday at the morning (00:00).

To make our script work with cron, we need to add an extra thing. Because if we run the code, we are asked for the password. Cron cannot enter the password, so it will fail. Based on suggestions, we should create ~/.pgpass file and add a line in it.

Now when the cron will run the script, everything will work.