How to install KairosDB time series database?

In my previous post, I described why I switched from OpenTSDB (other time series database) to KairosDB. In this post, I will show how to install and run KairosDB.

Requirements

To run KairosDB we actually just need KairosDB (if we ignore Ubuntu/Debian/something similar and Java). How is that possible? Well, KairosDB supports two datastores: H2 and Cassandra. H2 is actually an in memory H2 database. It’s easy to setup and cleanup, and it’s mostly used for development. Don’t use it in the production; it will work, but it will be very very slow.

For our tutorial we will use Cassandra as datastore. To install Cassandra, you can follow the official tutorial at http://wiki.apache.org/cassandra/GettingStarted. We will install it via apt-get.

You will want to replace 21x by the series you want to use: 20x for the 2.0.x series, 12x for the 1.2.x series, etc… You will not automatically get major version updates unless you change the series, but that is a feature.

We also need to add public keys to be able to access debian packages.

Now we are ready to install it.

This will install the Cassandra database. Few things you must know is that the configuration files are located in /etc/cassandra, and the start-up options (heap size, etc) can be configured in /etc/default/cassandra. Now that Cassandra is install, run it.

Another requirement is that you have Oracle Java JDK instead of OpenJDK. You must install version 7 or 8 (8 is recommended, I’m using 7). Again, we will install it with apt-get.

Source: http://stackoverflow.com/a/16263651/73010

KairosDB uses Thrift for communicating with Cassandra. When I installed Cassandra, it wasn’t enabled by default. So I had to enable it first. There are many ways and if you hate to fiddle with config files, you can install OpsCenter. It’s a really great tool for monitoring your cluster. It has a simple interface where you can access your nodes and change their configuration to enable Thrift. To change it the in the config file, update start_rpc setting to true in /etc/cassandra/cassandra.yaml.

Installing KairosDB

We can again install KairosDB in few ways.

a) Building from the source

a) Clone the git repository https://github.com/kairosdb/kairosdb.git
b) Make sure that JAVA_HOME is set to your java install.
c) Compile the code

b) Installing via .deb package (recommended)

Current stable version is 0.9.4 1.1.1. Make sure you download the latest version at https://github.com/kairosdb/kairosdb/releases.

Setting Cassandra as a datastore

As mentioned before, KairosDB by default uses H2 database for datastore. We need to change it to Cassandra.

a) If you are running from source, then copy kairosdb.properties to KairosDB root folder from src/main/resources/ folder to change it.
b) If you installed it, then change the file /opt/kairosdb/conf/kairosdb.properties.

In the file comment the line where H2 is set as datastore and uncomment Cassandra module. So the file should look like this.

You can also change some other setting to tune it, but for now just save it and you are ready to go.

Test if everything works

Make sure your Cassandra service is running. Now lets run KairosDB.

a) Running from source

b) Or if installed

Go to http://localhost:8080 to check if everything works OK. If you can see KairosDB dashboard, then congratulations, you can now use KairosDB.

What’s next

In the next tutorial we will see how to save, query and delete datapoints with web interface, HTTP and Telnet API.

Why I switched from OpenTSDB to KairosDB?

In my previous post, I described how to correctly install and use OpenTSDB. After some time, I decided to move on to other solution.

The story

Before everything, we need to know one thing. Because of IoT, the demand for storing sensor data has increased dramatically. Many new projects emerged, some are good, some are bad. They are different in technologies used, how fast they are and what kind of features they support.

You can read the full list of all IoT timeseries databases that can be used for storing data of you Internet of Things projects or startup.

Problems of OpenTSDB

OpenTSDB is great, don’t get me wrong. But when you try to use is with some more complex projects and customer demands, you can quickly hit the wall. It’s mostly because it involves a lot of moving parts to make it work (Hadoop, HBase, ZooKeeper). If one of the parts fail, the whole thing fails. Sure, you can replicate each thing and make it more robust, but you will also spend more money. When you are starting, it’s a over optimization and waste of money (that you don’t have).

Aggregation of the data is another problem. It does support basic function like min, max, avg etc. I spent days investigating the problem why avg aggregation is not working correctly when I filter by multiple tags. It just didn’t want to work and I couldn’t find anything in the docs. I asked on Google group and after some time I got a reply that I must use another aggregation function and that even that doesn’t work 100% as I want it. Another problem is when I want to get just one value – for example avg of all values from X to now. Not possible!

No clients to talk with OpenTSDB is another problem for me. Sure, storing the data with socket API is super simple and can be easily integrated in every language. The HTTP API is another story. Sure, again it shouldn’t be a problem to implement my own client, but why waste time with this?

Development of the OpenTSDB is slow and it takes ages for new features to be integrated. One of them (one of the most important for me) is an ability to support time zones. It’s used when downsampling data to one day (or even more) so data is correctly grouped. There was some work, but until today it still wasn’t implemented. Too bad.

On the bright side, OpenTSDB is super fast. I was able to store and load data as super fast rate – loading 3 million records in few seconds is for me super fast. Try it with relational database and you will be quickly disappointed.

KairosDB to the rescue

I remember when I was doing a research, I noticed KairosDB but I didn’t spend too much time testing it. It just wasn’t appealing and I didn’t know how it actually works. Big mistake.

KairosDB uses Cassandra to store data (compared to HBase used with OpenTSDB) and it’s actually a rewritten and upgraded version of OpenTSDB. It has evolved into great project. It has many more features: many more (and fully working) aggregation methods, option to easily delete metric or datapoint, easy extensibility with plugins etc. It has great clients and has much more active community. I remember when I asked a question on OpenTSDB Google group and waited weeks for an answer (I’m not forcing anyone to provide the support, because after all, it’s an opensource project), while on KairosDB Google group I got it within a day.

Why is this important you might ask? Well, when you are catching deadlines and something goes wrong, responsive community is very important. Sometimes this kind of things can be a difference between success and a failure.

What now?

I wrote an tutorial how to start with KairosDB. You can also you visit kairosdb.org and check out the documentation. Feel free to play with it, test it and hopefully also use it in production. I

The complete list of all time series databases for your IoT project

While searching the perfect database for my project, I spent hours and hours searching the internet and making a list of all candidates. Quickly I realized that the list is pretty long and the projects differ in many ways, but all of the have the same goal: store your time series data.

How the data looks like

The structure of time series data always consists of at least 2 parts (we call it a datapoint): time and value. At certain time we have a certain value. Based on architecture of different time series databases, we also annotate the datapoint with additional information. The goal of the information is to better differentiate the data and filter it easier. One of the examples is adding source=device.1 tag to the datapoint. Later we can easily fetch all the data that belongs to device.1.

So in some context, time series databases are similar to key-value databases where key is combination of time and tags. Only difference is, that we have a better ability to filter the data and all the nice aggregations functions (min, max, avg, dev, ..) are already built in (well in most cases). As with key-value databases everything starts with a key, in time series databases everything starts with time.

Below is the list of all the time series databases I found that follow the previous mentioned principle. If you find a new unlisted database or you create a new one and want to share, send me an email at erol@(enter my domain name).

The list

1. OpenTSDB
Website: http://www.opentsdb.net/
Pricing: Free
Technologies: Java, HBase
Presentation: http://youtu.be/WlsyqhrhRZA

Store and serve massive amounts of time series data without losing granularity.

2. KairosDB
Website: http://www.kairosdb.org
Pricing: Free
Technologies: Java, Cassandra
Clients: Java, Python
Presentation: http://youtu.be/Ykf_C9RZEQI?t=31m15s

KairosDB is a fast distributed scalable time series database written on top of Cassandra.

3. InfluxDB
Website: http://influxdb.com/
Pricing: Free
Technologies: Go, BoltDB
Clients: JavaScript, Ruby, Python, Node.js, PHP, Java, Clojure, Common Lisp, Go, Scala, R, Erlang, Perl, Haskell, .NET
Presentation: http://youtu.be/sRi64imN7xg

InfluxDB is a time series, metrics, and analytics database. It’s written in Go and has no external dependencies. That means once you install it there’s nothing else to manage (like Redis, ZooKeeper, HBase, or whatever).

4. TempoIQ
Website: https://www.tempoiq.com/
Pricing: Subscription
Clients: .NET, Java, Node.js, Python, Ruby
Presentation: http://youtu.be/TRv0tfFAdbY

Fast, scalable monitoring & analysis of sensor data in your application.

5. Graphite
Website: https://github.com/graphite-project

Graphite is a highly scalable real-time graphing system. As a user, you write an application that collects numeric time-series data that you are interested in graphing, and send it to Graphite’s processing backend, carbon, which stores the data in Graphite’s specialized database. The data can then be visualized through graphite’s web interfaces.

6. Druid
Website: http://druid.io/
Pricing: Free
Technologies: Java
Clients: Ruby, Python, R, Node.js
Presentation: http://youtu.be/Dlqj34l2upk

An open-source, real-time data store designed to power interactive applications at scale.

7. kdb+
Website: http://kx.com/
Technologies: K
Clients: Java, .NET, Python, Excel
Presentation: http://youtu.be/AGGGU7tVdEk

The high-performance database that sets the standard for time-series analytics.

8. RRDtool
Website: http://oss.oetiker.ch/rrdtool/

RRDtool is the OpenSource industry standard, high performance data logging and graphing system for time series data. RRDtool can be easily integrated in shell scripts, perl, python, ruby, lua or tcl applications.

9. seriesly
Website: https://github.com/dustin/seriesly
Technologies: Go

seriesly is a database for storing and querying time series data. Unlike databases like RRDtool, it’s schemaless so you can just lob data into it and start hacking. However, it also doesn’t use a finite amount of space.

10. Cube
Website: http://square.github.io/cube/ (development seems stopped, most active fork is https://github.com/red-gate/cube)
Pricing: Free
Technologies: Node.js, MongoDB

Cube is a system for collecting timestamped events and deriving metrics. By collecting events rather than metrics, Cube lets you compute aggregate statistics post hoc. It also enables richer analysis, such as quantiles and histograms of arbitrary event sets. Cube is built on MongoDB.

11. IBM Informix
Website: http://www-01.ibm.com/software/data/informix/

Informix, with its TimeSeries feature, helps organizations solve the Big Data challenge of sensor data by providing unprecedented performance and scalability to applications that leverage time series data.

12. Akumuli
Website: http://www.akumuli.org/
Pricing: Free
Technologies: C++

Distributed time-series database

13. BlueFlood
Website: http://blueflood.io/
Technologies: Java
Presentation: http://vimeo.com/87210602

Blueflood is a multi-tenant distributed metric processing system created by engineers at Rackspace. It is used in production by the Cloud Monitoring team to process metrics generated by their monitoring systems. Blueflood is capable of ingesting, rolling up and serving metrics at a massive scale.

14. DalmatinerDB
Website: https://dalmatiner.io/
Technologies: Erlang, ZFS, Riak Core

DalmatinerDB is a no fluff purpose built metric database. Not a layer put on top of a general purpose database or datastores.

15. Rhombus
Website: https://github.com/Pardot/Rhombus
Pricing: Free
Technologies: Java, Cassandra

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

16. Prometheus
Website: http://prometheus.io/
Pricing: Free
Technologies: Go
Clients: Go, Java, Ruby

An open-source service monitoring system and time series database.

17. Axibase Time-Series Database
Website: http://axibase.com/products/axibase-time-series-database/
Pricing: Free & License version
Technologies: Java, HBase, Hadoop
Clients: Java, R Language, PHP, Python, Ruby, JavaScript

Axibase Time-Series Database (ATSD) is a next-generation statistics database. ATSD is for companies that need to extract value from large amounts of time-series data which exists in their IT and operational infrastructure.

18. Newts
Website: http://opennms.github.io/newts/
Pricing: Free
Technologies: Java, Cassandra

A time-series data store based on Apache Cassandra.

19. InfiniFlux
Website: http://www.infiniflux.com/
Pricing: Free & License version
Technologies: C
Clients: Java, Python, JavaScript, R, PHP​

INFINIFLUX is the World’s Fastest Time Series DBMS for IoT and BigData.

20. Heroic
Website: https://spotify.github.io/heroic/
Pricing: Free
Technologies: Java, Cassandra, Elasticsearch

The Heroic Time Series Database

21. Riak TS
Website: http://basho.com/products/riak-ts/
Pricing: Free & License version
Technologies: Erlang
Clients: Java, Ruby, Python, Erlang, Node.js
Presentation: https://www.youtube.com/watch?v=l-U-oSnpdLQ

Riak TS is the only enterprise-grade NoSQL database optimized for IoT and Time Series data

22. The Warp 10 Platform
Website: http://www.warp10.io/
Pricing: Free
Technologies: Java, WarpScript
Presentation: http://www.slideshare.net/Mathias-Herberts/warp-10-platform-presentation-criteo-beer-tech-20160203

Warp 10 is an Open Source Geo Time Series® Platform designed to handle data coming from sensors, monitoring systems and the Internet of Things.

23. KsanaDB
Website: https://github.com/zzzmanzzz/KsanaDB
Pricing: Free
Technologies: Go, Redis

KsanaDB is a time series database, base on redis and go.

24. eXtremeDB DBMS
Website: http://financial.mcobject.com/
Pricing: License version
Presentation: https://youtu.be/lG7Fw1sHFKQ

The eXtremeDB DBMS product family delivers high levels of scalability, reliability and processing speed for storing and manipulating complex data, and is used successfully in environments ranging from Big Data analytics in the data center to supporting increasingly “smart” features in the resource-constrained devices comprising the Internet of Things.

25. SiriDB
Website: http://siridb.net/
Pricing: Free
Technologies: C

SiriDB is a highly-scalable, robust and fast time series database. Build from the ground up SiriDB uses a unique mechanism to operate without a global index and allows server resources to be added on the fly. SiriDB’s unique query language includes dynamic grouping of time series for easy analysis over large amounts of time series.

26. DB4IoT
Website: http://db4iot.com/
Pricing: License version
Presentation: https://youtu.be/X-4OB_8SzTI

Visualize and analyze time-series IoT data with blazing-fast interactive maps for the
“Internet of Moving Things.”

27. quasardb
Website: http://www.quasardb.net/
Pricing: Client API Open Source, Community Edition and Enterprise Edition
Technologies: C++14
Clients: Java, .NET, Python, Excel, C, C#, C++, R

Quasardb is a high-performance, distributed, column-oriented database with native time series support.

I will frequently update the list and add new time series databases as they come along.

Best Internet of Things books

First we need to learn few things about IoT. Although the concept wasn’t named until 1999, the Internet of Things has been in development for decades. The first Internet appliance, for example, was a Coke machine at Carnegie Melon University in the early 1980s. The programmers could connect to the machine over the Internet, check the status of the machine and determine whether or not there would be a cold drink awaiting them, should they decide to make the trip down to the machine.

Kevin Ashton, cofounder and executive director of the Auto-ID Center at MIT, first mentioned the Internet of Things in a presentation he made to Procter & Gamble. From that day, IoT started to slowly grow and today it’s one of the fastest growing trends.

10 best Internet of Things books

One of the best ways to learn about IoT is using books. I have completed the list of some of the best and a must reads.

  1. Internet of Things (A Hands-on-Approach) (2014)
    Internet of Things (IoT) refers to physical and virtual objects that have unique identities and are connected to the internet to facilitate intelligent applications that make energy, logistics, industrial control, retail, agriculture and many other domains “smarter”.
  2. Designing the Internet of Things (2013)
    Whether it’s called physical computing, ubiquitous computing, or the Internet of Things, it’s a hot topic in technology: how to channel your inner Steve Jobs and successfully combine hardware, embedded software, web services, electronics, and cool design to create cutting-edge devices that are fun, interactive, and practical. If you’d like to create the next must-have product, this unique book is the perfect place to start.
  3. The Internet of Things (2014)
    You might only have heard this expression recently (indeed you might never had heard of it) but, apparently this is a concept that has been around for some time.The term was coined around the turn of the millennium and refers to the potential interconnectivity of basically all electronic devices and capacity to record, monitor and transmit information between them to achieve all manner of wonderful (and maybe not-so-wonderful) outcomes.
  4. The Silent Intelligence: The Internet of Things (2013)
    The Silent Intelligence is a book about the Internet of Things. We talk about the history, trends, technology ecosystem and future of Connected Cities, Connected Homes, Connected Health and Connected Cars. We also discuss the most exciting growth areas for entrepreneurs and venture capital investors.
  5. Rethinking the Internet of Things: A Scalable Approach to Connecting Everything (2013)
    Over the next decade, most devices connected to the Internet will not be used by people in the familiar way that personal computers, tablets and smart phones are. Billions of interconnected devices will be monitoring the environment, transportation systems, factories, farms, forests, utilities, soil and weather conditions, oceans and resources.
  6. The Epic Struggle of the Internet of Things (2014)
    If the hype is to be believed then the next big thing is the Internet of Things. But is it what you think it is? Because the Internet of Things is not about things on the internet. A world in which all our household gadgets can communicate with each other may sound vaguely useful, but it’s not really for us consumers.
  7. Designing for Emerging Technologies: UX for Genomics, Robotics, and the Internet of Things (2014)
    The recent digital and mobile revolutions are a minor blip compared to the next wave of technological change, as everything from robot swarms to skin-top embeddable computers and bio printable organs start appearing in coming years. In this collection of inspiring essays, designers, engineers, and researchers discuss their approaches to experience design for groundbreaking technologies.
  8. From Machine-to-Machine to the Internet of Things: Introduction to a New Age of Intelligence (2014)
    This book outlines the background and overall vision for the Internet of Things (IoT) and Machine-to-Machine (M2M) communications and services, including major standards. Key technologies are described, and include everything from physical instrumentation of devices to the cloud infrastructures used to collect data.
  9. Enchanted Objects: Design, Human Desire, and the Internet of Things (2014)
    In the tradition of Who Owns the Future? and The Second Machine Age, an MIT Media Lab scientist imagines how everyday objects can intuit our needs and improve our lives. We are now standing at the precipice of the next transformative development: the Internet of Things.
  10. Invest, Make Money & Retire Early From The Internet Of Things Revolution (2014)
    The internet of things will equate to one of the single most lucrative investment opportunities in the history of modern business and will be bigger than the economies of China, Norway, and Canada, combined. Simply put the internet of things is connecting any device with an on and off switch to the Internet (and/or to each other).

OpenTSDB – the perfect database for your Internet of Things projects

I got a call the other day with a question: how can we store huge amount of sensor data. They are measuring air temperature in different rooms every 5 seconds. That means 17280 data points per data, 6307200 data points per year and for 15 rooms 94608000 data points per year.

Because I never had a situation where I needed to store a huge amount of sensor data, I didn’t know the answer. But I started digging. There are many questions online regarding what database to use to store this kind of data. Some recommend old-school databases like MySQL or Oracle. Some recommend Redis, Riak or MongoDB. But one recommendation beat them all: OpenTSDB.

OpenTSDB – The Scalable Time Series Database
Store and server massive amounts of time series data without losing granularity.

Currently in version 2.0, OpenTSDB is a tested solution build on top of HBase. It was designed especially for time series data and can handle

– up to 100+ billion data points and
– 2000 new data points per second (tested by OpenTSDB developers on a old dual-core Intel Xeon CPU from 2006; I tested on a newer machine and could easily insert 20000 points in few seconds).

Long story short. It’s perfect database for huge amount of sensor data. It has great options to query data (I will explain it below), has additional features to annotate data and it’s under active development.

Installation and running it for the first time

To run OpenTSDB, you need to first install HBase. The procedure is pretty straightforward. First you need to install HBase. Download HBase, unpack, define configuration and run it with

If everything was defined correctly, you should get a message

Next step is installing OpenTSDB. There is a great tutorial how to install OpenTSDB. In short, download it and unpack or clone git repository, and run build.

It should take few minutes to compile everything. Next step is to create tables with command

You can see created tables with few opensource HBase viewers like hrider. Currently the compression is set to none. It’s highly recommend to use compression LZO, because there is no performance impact but it can highly reduce the size of your data.

Because we will store temperatures in metric temperatures, we need to create it first. OpenTSDB has a configuration to enable auto creation of metrics, but it’s not recommended, so we will do it manually.

The last step is to run everything.

If everything went well, you should see OpenTSDB page at localhost:4242. It’s that simple.

How data is stored

How OpenTSDB is storing the data is in my opinion the biggest difference compared to other databases. It does support tables, but they are actually called metrics. In each metric we can store data points. Each data points is structured as

Timestamp (unix time or ISO 9601 format) is the time of the data point. Value is a number (integer or float). Then we have tags. With tags we separate data points. In our example, we are storing value for bedroom on our first floor. This structure enables us to separate data and later make advanced queries; for example average temperature on first floor or sum of all rooms.

Storing data

With version 2.0, OpenTSDB has 2 ways to store and access data (plus one additional to store by importing data). They are Telnet API, HTTP API and batch import from a file. Make sure you have OpenTSDB running before you try the examples below.

Storing with Telnet API [Java]

We need to execute command PUT with metric and data. še dopiši

Storing with HTTP API
When working with HTTP API, we have to make a PUT request to the URL localhost:4242/api/put with JSON data.

There is also a possibility to make a batch insert. Just wrap all metrics in an array.

Personally I had few problems inserting a large amount of data with the API. I ended up using Telnet API and it seems to work really well.

Querying the data

The whole beauty of OpenTSDB is it’s ability to not only to store huge amount of data, but to also query it fast. I will be showing how to query data with HTTP API, but the same query parameters can be used with Telnet API.

For the examples, we will first insert some data. Of course we can insert a much large dataset, but for this tutorial lets keep it simple.

Getting all temperatures

Let’s break down the request:
1. We can make GET or POST requests
2. The HTTP API URL is http://localhost:4242/api/query
3. We must define start, but end is optional. It can be unix timestamp or you can define nx-ago where n is unit and x is metric. For example, 1day-ago or 1h-ago. OpenTSDB will automatically convert it to timestamp based on your time.
4. m is the metric, where we are using aggregation = sum and metric = temperatures.
5. The last is grouping_operator (inside {}), which is used to group the data. If we define it with *, then it will not group the data. We can also use it to filter. For example room=bedroom will only fetch data from bedroom.

You can read more about different parameters and what they do at http://opentsdb.net/docs/build/html/api_http/query/index.html.

Our above request returns JSON.

Getting temperatures in the bedroom

As mentioned above, we can query by tags. In our case by room=bedroom.

returns

Getting average temperature on first floor

To calculate the average of the temperatures on first floor, we have to group by tags. Be careful to define correct aggregation function (in our case avg). See all aggregators at http://opentsdb.net/docs/build/html/user_guide/query/aggregators.html#available-aggregators.

produces

We can see tag room in aggregateTags. It means it used this tag to aggregate (or if you are familiar with other databases, GROUP BY) data.

Getting average temperatures per day

Let’s imagine a situation where we want to create reports of the temperatures on a daily basis. We could load all the data and then manually calculate the averages. For larger datasets it could take some time. OpenTSDB has an answer. We can also define downsampling. Downsampling will automatically calculate the values based on our downsampling aggregation function and timeframe.

Notice different parameter m? We added 1d-avg (be careful to separate everything correctly with “:”), which will downsample by 1 day and calculate average. Compared to manual way, it’s much faster and it just gives us results, which we can use in graphs.

Other awesome features

OpenTSDB has few additional features to cover real-life situations. Of course we can easily add more with plugins. But 2 of them worth mentioning are Annotations and CLI Tools.

Annotations
Annotations enable us to add additional meta data to data points. For example, we could store information when we opened and closed window in each room or when we changed the heating level.

Read more at http://opentsdb.net/docs/build/html/api_http/annotation.html.

CLI Tools

CLI Tools are just simple tools to perform additional task like fixing the data storage (in case if something breaks down), querying and deleting data and creating metrics. One of the most common tools I use it scan, because it has the feature to delete data. t’s useful when we are doing different tests.

To delete all temperatures for basement, we execute command

Again, we can filter what to delete with start and end parameters, metric and tags.

Wrap up

OpenTSDB has been proved to be an excellent solution. It’s scalable, fast and has really neat features. Most importantly, it’s under active development and has many people contributing. With the era of IoT and Big Data upon us, it has a bright future ahead.

If you are ready, start with http://opentsdb.net/docs/build/html/index.html.