Python and multiple constructors

One thing I missed when switching from Java to Python was multiple constructors. Python does not support them (directly), but there a may other approaches that work very similar (maybe even better).

Problem

Let’s say we are building a client to query remote service (some aggregation service). We want to pass the aggregator.

To make code more fluent and giving it more robustness for integrating into other solutions, we have multiple options to create an aggregator.

The query.aggregator will create a new instance of Aggregator and pass it to the request.

(Possible) solution

Python has a great feature of passing args and kwargs. We can create a constructor

then in the constructor we check and parse args and kwargs. This solution works, but it has many problems:

  1. No indication what is required and what not
    This is most important for autocompletion. When I want to create a new instance of the class Aggregator, I want to know what is required. With current constructor, this is really hard.
  2. Complexity and combinations
    There are many combinations how to initialize a new instance by passing different arguments.

    This is absolutely weird and hard to read.

Better solution

Python has an option to decorate a method with @classmethod. We can define custom methods that work as multiple constructors. For example, we can create a method from_arguments.

We use it as Aggregator.from_arguments(args). The validation of the parameters (if value an int) is done in the constructor.

The from_arguments method just parses the arguments and creates a new instance of the Aggregator. We could add a validation (if list has at least 2 items, if str is in correct format, if dict has all the required elements, …).

Django Rest Framework, NestedSerializer with relation and CRUD

I started a Django project that enables other services to interact with it over the API. Of course, one of the best solutions for building the API using Python is Django Rest Framework. Great project with large community that got supported on Kickstarter. How cool is that?

Project

My project/service offers among other things access and creation of companies and subscriptions. Each company can have multiple subscription – we have a one-to-many relation. I quickly created the models Company and Subscription.

One thing to notice here is that I use UUId’s. The reason lays in the fact that some other services also contain company data. Those services will create companies as they have all the required data (id, name). With this I’m able to resolve sync problems.

For subscription model I will create UUID by using random method.

Django Rest Framework

Django Rest Framework has great docs. If you follow quickstart, you can set up working API in few minutes.

In my case, I created serializers

I had to define additional id for the company serializer. By default id’s are read only (normally id’s are generated at the database level), but in my case I pass the id while creating the company.

I also created viewsets.py.

For the last step you have to add the viewsets to API router.

Now when you access /api/companies/ or /api/subscriptions you should get a response (for now probably only empty array).

This part is very simple and there are tons of examples how to do this.

Problem

To create a company, I execute a POST JSON request (I’m using Postman) to /api/companies/ with the following payload.

and I get returned

Now I have a company in the database. Let’s create a subscription. Again, I execute POST JSON requst to /api/subscriptions with payload

and I get an error that company name is required. What?

Same request and response

Before I go into explaining what previous error means and how I solved it, I have to first explain what I want.

Other services that talk with my service use different HTTP clients. One of them is also Netflix Feign. With it you can simply create HTTP clients that map the request or response to DTO’s. For example, they have a SubscriptionDTO defined as

and CompanyDTO

So same DTO is used for request and response. I want to pass the same DTO with all the required data when creating the subscription. When response is returned, it populates the existing SubscriptionDTO. This is important, because I want to eliminate the confusion when using multiple DTO’s for same entity (Subscription).

Process of identifying the problem

To return to previous error. When I want to retrieve the subscription, I also want to include company information in the subscription list.

I accomplished this by defining

in my SubscriptionSerializer. If I didn’t use this, then the response would be in format

But I don’t want this, I want the full output. When I defined company field, I didn’t pass any arguments. By default it means that when I execute the POST, it will create subscription and all it’s relations (company). That is why I got an error that company name is required, because it wanted to create a new company (but name is missing). But I don’t want this.

I checked online and asked few people. Most of them suggested that I pass read_only=True argument when I define the company field: company = CompanySerializer(read_only=True). Now when I executed the POST, I got that subscription.company_id database field should not be null. Once you define read_only for a field, it’s data is not passed to method that creates the model (subscription). Why?

There are many discussions around how to solve this.

a) https://groups.google.com/forum/#!topic/django-rest-framework/5twgbh427uQ
b) http://stackoverflow.com/questions/29950956/drf-simple-foreign-key-assignment-with-nested-serializers
c) http://stackoverflow.com/questions/22616973/django-rest-framework-use-different-serializers-in-the-same-modelviewset

Some suggest different serializers, other using 2 fields (one for read and other for create/update). But all of them seem hackish and impose a lot of extra code. Author of the DRF Tom Christie suggested that I define CompanySerializer fields (except id) as read only. This kinda solved the problem. If company has additional fields, then I need to overwrite them also which means extra code. At the same time, I want to preserve the /api/companies/ endpoint for creating/updating companies. If I set fields as read only, then I wouldn’t be able to create companies without having additional CompanySerializer.

I tried to overwrite subscription create methods, but without a success. If I defined read_only=True when creating field company, then no company information was passed to validated_data (the data that is later used to created a subscription). If I defined read_only=False, then I was always getting “name is required” error.

I wanted a simple and working solution.

Solution

I started to look for a solution that was simple and enabled me to make the requests that I want. Digging through the code I noticed many methods for field creation that I could overwrite. On the end, I had to modify validation method.

I overwrote the validate_empty_values where I check the relation. The idea is that I check posted data. If there is an id (or primary key) of the relation model present, I validate that record exists for that id and return it. If it doesn’t exist or the data is invalid, I raise an error.

There is also a is_relation argument that you have to pass when creating serializer. The is only used when creating serializer as nestedserializer. The updated code is

What this does is that now I can execute POST JSON requests with payload

and get a response

Same DTO for request and response. At the same time, I didn’t modify the /api/companies/ endpoint. Companies get created/updated normally with all the required validation working as it should.

Passing collections between Akka actors

Akka actors are great when we are looking for a scalable real-time transaction processing (yes, this is the actual definition using some big words). Actually, it’s really great for some background processing because you can create many instances without actually worrying about concurrency and parallelism.

The code

We have a simple application for processing the uploaded file. We accept the file, parse it (simple txt file), calculate the values and save them in some database. We could have everything in one actor, but it’s much better to split it in multiple actors and create a pipeline. Each actor does exactly one thing. We have much more clean code and at the same time, testing it is much easier.

We have (for this demonstration) 2 actors. One reads the files to a List and sends it to another actor.

The second actor gets a the List of numbers and calculates the sum of them.

If we used this code, we would quickly discover problems. When I tested it with VisualVM for memory leaks, I quickly discovered a memory leak with List numbers. How to solve it?

Immutable collections

When passing object between actors we need to follow few guidelines. If we brake them, we can face memory leaks and consequentially app crashes. One of the guidelines is to use Immutable collections. If we pass them between actors, they have to be Immutable. What are the advantages of Immutable objects?

  1. Thread-safe – so they can be used by many threads with no risk of race conditions.
  2. Doesn’t need to support mutation, and can make time and space savings with that assumption.
  3. All immutable collection implementations are more memory-efficient than their mutable siblings (analysis)
  4. Can be used as a constant, with the expectation that it will remain fixed.

There are many implementations of Immutable collections and one of the best ones is in Guava.

Improved code

We have to use ImmutableList to create a list of numbers for passing between actors.

Rerunning VisualVM confirmed that memory leak was resolved. Great.

Why I switched from OpenTSDB to KairosDB?

In my previous post, I described how to correctly install and use OpenTSDB. After some time, I decided to move on to other solution.

The story

Before everything, we need to know one thing. Because of IoT, the demand for storing sensor data has increased dramatically. Many new projects emerged, some are good, some are bad. They are different in technologies used, how fast they are and what kind of features they support.

You can read the full list of all IoT timeseries databases that can be used for storing data of you Internet of Things projects or startup.

Problems of OpenTSDB

OpenTSDB is great, don’t get me wrong. But when you try to use is with some more complex projects and customer demands, you can quickly hit the wall. It’s mostly because it involves a lot of moving parts to make it work (Hadoop, HBase, ZooKeeper). If one of the parts fail, the whole thing fails. Sure, you can replicate each thing and make it more robust, but you will also spend more money. When you are starting, it’s a over optimization and waste of money (that you don’t have).

Aggregation of the data is another problem. It does support basic function like min, max, avg etc. I spent days investigating the problem why avg aggregation is not working correctly when I filter by multiple tags. It just didn’t want to work and I couldn’t find anything in the docs. I asked on Google group and after some time I got a reply that I must use another aggregation function and that even that doesn’t work 100% as I want it. Another problem is when I want to get just one value – for example avg of all values from X to now. Not possible!

No clients to talk with OpenTSDB is another problem for me. Sure, storing the data with socket API is super simple and can be easily integrated in every language. The HTTP API is another story. Sure, again it shouldn’t be a problem to implement my own client, but why waste time with this?

Development of the OpenTSDB is slow and it takes ages for new features to be integrated. One of them (one of the most important for me) is an ability to support time zones. It’s used when downsampling data to one day (or even more) so data is correctly grouped. There was some work, but until today it still wasn’t implemented. Too bad.

On the bright side, OpenTSDB is super fast. I was able to store and load data as super fast rate – loading 3 million records in few seconds is for me super fast. Try it with relational database and you will be quickly disappointed.

KairosDB to the rescue

I remember when I was doing a research, I noticed KairosDB but I didn’t spend too much time testing it. It just wasn’t appealing and I didn’t know how it actually works. Big mistake.

KairosDB uses Cassandra to store data (compared to HBase used with OpenTSDB) and it’s actually a rewritten and upgraded version of OpenTSDB. It has evolved into great project. It has many more features: many more (and fully working) aggregation methods, option to easily delete metric or datapoint, easy extensibility with plugins etc. It has great clients and has much more active community. I remember when I asked a question on OpenTSDB Google group and waited weeks for an answer (I’m not forcing anyone to provide the support, because after all, it’s an opensource project), while on KairosDB Google group I got it within a day.

Why is this important you might ask? Well, when you are catching deadlines and something goes wrong, responsive community is very important. Sometimes this kind of things can be a difference between success and a failure.

What now?

I wrote an tutorial how to start with KairosDB. You can also you visit kairosdb.org and check out the documentation. Feel free to play with it, test it and hopefully also use it in production. I

pg_dump: permission denied for relation mytable – LOCK TABLE public.mytable IN ACCESS SHARE MODE

One of the good practices is to create backups of your database at regular intervals. If you are using PostgreSQL database, you can use built-in tool called pg_dump. With pg_dump we can export the database structure and data. In case we want to dump all databases, then we can use pg_dumpall.

When I was creating a simple bash script, I was getting a very strange error: pg_dump: permission denied for relation mytable – LOCK TABLE public.mytable IN ACCESS SHARE MODE. Googling around I got few tips how to solve the problem, but no actual solution.

Script to dump

To make our life easier, we use a script to make the whole process easier. It’s also convenient to have a script which we can later call from other processes, from build tools (backup because upgrading) or with cron.

When we run this, we get previous error. Big problem.

Locked table problem

Problem is with permissions. There are multiple permission layers. First is if we actually have an access to database. Second layer is if we actually have an access to table; in our case table mytable. To check it, we need to see the structure and permissions of the table.

Above commands with output all the tables in the database. If we check the columns, we will notice that there is a owner column. In our case it’s important that table owner and export user is same, otherwise we get the permission problem.

To change the permission of the table, we need to run the command

The command will alter the table ownership to our export user. Be sure you change owner for every table in the selected database.

Extra tip – cron

Of course we don’t have time and we especially don’t want to waste time for tasks that can be automated. One of them is actually running our dbbackup.sh every week, month or at some interval you desire. To perform backups every week, we can use cron.

To add a cron job, just run

It will show simple editor where you write your tasks/jobs. In our case, we will run backup every Sunday at the morning (00:00).

To make our script work with cron, we need to add an extra thing. Because if we run the code, we are asked for the password. Cron cannot enter the password, so it will fail. Based on suggestions, we should create ~/.pgpass file and add a line in it.

Now when the cron will run the script, everything will work.

Scripts to start and stop Play Framework application

Play Framework (I’m talking about 2.X version) has multiple deployment ways. If you just check the docs, you will notice it has few pages of instructions just how to deploy your Play Framework application. For me, using stage seems to be the best and most stable way.

Stage

When deploying an application, I always run clean and stage commands. The first one will remove compiled and cached files. Second one will compile the application and create an executable. Everything will be located in ${application-home}/target/universal/stage/. There you have a bin folder and inside simple script to run the app. We will create start/stop scripts to make our work a little bit easier.

Scripts start/stop

To run the application, we use nohup. nohup enables us that the application will run even when we close the terminal or logout from our development machine. As everyone says, it’s not perfect solution but it works. We run the command in stage folder and add additional parameters.

  1. -J-Server enables us to pass additional JVM related settings. In our case, we define Xms and Xmx for better memory management.
  2. We have application.conf for development and application-prod.conf for the production with different settings (database, secret key, API logins, etc).
  3. With -Dhttp.port we define the port of our application. We use apache to map the 80 port to 9000. It’s much safer and easier like this, because later we can have load balancer in the middle to divide the load to multiple application instances.

When running nohup, it will create a nohup.out file in which it logs everything (basically what application returns). Don’t confuse it with application logs. Application will still log everything based on logger.xml configuration independently of nohup. To prevent nohup.out file, we have to redirect everything to /dev/null and basically just ignore it.

On the end, we output the pid into RUNNING_PID file. Be careful. Play Framework automatically creates an additional file RUNNING_PID in stage folder. We add this as extra information and the file is removed after stopping our application.

When we want to stop our application, we need to get the pid. We read it from stage folder RUNNING_PID file and pass it to kill command. For safety reasons we wait 5 seconds just to be sure that application is stopped. We could have a running job which needs few more seconds to complete or save the state.

Extra tip

We can also pass additional parameters to our stage command. One of them is javaagent. If we are using some remote monitoring solution like newrelic, we can include jar to send application data.

To do so, we need to pass -J-javaagent:/path/to/newrelic.jar with all the other parameters. Be sure that you include the correct path, because otherwise it will fail to start the application.

Handle file uploads in Play Framework 2.x [Java]

Most applications have the ability to upload something. Handling uploaded files should not be hard. We need to check if user uploaded the file, if it’s the right type and store it. Mather of fact, this is really easy with Play Framework.

An example

We have a form to upload a file.

This form will take one file and post it to /upload path. To be able to upload a file, we need to define enctype to multipart/form-data. This just means how the POST will be constructed and how the file will be send.

Next this is to create a controller and a method. We will only enable uploading of PDF files.

Very simple, right? I highly recommend you move this code to somewhere else (for example to some service). Good practice is to keep controllers slim.

First we need to check type of the request and check if it’s multipart/form-data. If body is null, then something is wrong. Same thing is for a file. If there is no file present, we need to report an error. Beware, it’s easy to modify the content type. Checking if the file is really PDF can sometimes be more difficult. Best ways are to use some additional libraries – one of the is Apache Tika.

Handling multiple files

We can also handle multiple file at once. All we have to do is loop through all posted files.

Extra tip

When we upload the file, it will be stored into /tmp folder (of course if we use Linux server). Then we just need to move or copy the file to the right folder.

Recommended way is to use highly tested library Apache Commons IO and methods FileUtils.copyFile(source, destination) or FileUtils.moveFile(source, destination).

Run a program as certain user from service on Windows XP

At work we are collecting data about different production lines. We use Windows XP (yes, I know) machines. On them we have a server which connects to sensors. We access the server with our Python script and use data for further processing.

Pretty simple, right? Well it gets little more complicated. We have a service (Quartz.net) to call our Python script every 5 seconds. Everything works great, but there is one little problem.

The problem

The server is very unstable and crashes a lot. There is no log to check what is the problem. We have no idea how to make it more robust. But good thing is that when server crashes, all you have to do is login to the machine and restart it (doubleclick the icon to start the program again).

So the idea was just to automatically start the server program when we detect crash. But it’s not so simple.

Solution #1

First we need to know one thing. The Quartz.net service is running as SYSTEM. It calls the Python script as SYSTEM user. That means that everything we call from script, it will run under SYSTEM user.

Our first solution was to simply call the server program from our Python script.

As mentioned before, this will run server.exe as SYSTEM user. Not a good thing, because quickly we got problems with too much instances and memory leaks. If user manually started the server, it ran 2 instances under different users.

Solution #2

We cannot run program as SYSTEM user, but as logged in user – Mike. Idea was to use some Python pywin32 extensions to control the system. There is a nice code where you pass user credentials with program path and it works. Then you use the code to run the process as another user.

The code does not work for me. I tried different variations, but not success. Someone also mentioned that I need to change some permissions. Even though I changed them, it still did not work. Even if it did, it would be really tricky to make this kind of changes if they require admin permissions.

Solution #3

Let’s forget running a program from Python script and maybe use some other way. Of course, here comes the Windows Batch. There is a really nice command runas. The command takes the login credentials and path to executable program.

This works only if your user doesn’t have password. But out Mike user has password, so we couldn’t use it.

The problem is that when you start the batch script which calls runas, it will prompt you for a password. There is a parameter /savecred. Basically you enter only password the first time and it memorizes it. But in our case, the service calls the Python script and Python script calls the batch file which calls runas. But when runas prompts for a password, the service cannot enter it. So nothing happens and this also does not work.

Solution #4

Reading online, these is a really nice program called PsExec. This allows you to login to the computer (mostly used for accessing remote computer, but also works for local) as certain user and execute program.

First I tried to run it without -l parameter and nothing happened. But when I added, it still ran as SYSTEM user instead of user Mike. -l parameter actually means Run process as limited user, which explains the problem. Again, did not work for us.

Solution #5 – The working one

The working solution is really interesting and uses built in features of Windows. It’s called Task Scheduler.

Windows has builtin command called schtasks. With this command we can schedule a new task, define the frequency, manually start it and on the end even delete it.

Creating a scheduled task has few problems. First is that scheduled tasks are executed every minute (even though we can define start time /st with seconds). In our case, we need to define that task runs few seconds after we created it. But if we define it to run at 16:00:05, it will actually never run. Because all times from 16:00:00 to 16:00:59 will actually run at 16:00:00. We could add a minute to the time, or example 16:01:00, but in worse case scenario we would need to wait almost a minute for scheduled task to run. At the same time, adding minutes to time in bat script is not really easy.

Our solution actually says to run at end of the day (time actually doesn’t even matter), but then we manually execute it. There is a ping method, which is actually similar to sleep. We wait for 3 seconds for task to finish and then delete it.

In short, our code created a task to run under user Mike, manually runs the task, waits for it to finish and then deletes the task. The end result in that our program finally starts under user Mike.

But beware of the on really crazy thing. The /sc parameter defines the frequency of the task. We can define or to run task every day or week and so on. But the parameter is language dependent. So in English is ONCE, in Slovenian is ENKRAT, in German EINMAL and so on. Strange, right?

Why I think Spring Data repositories are awesome – Part 2

In the first part, we covered some very basic things we can do with Spring Data repositories. In this part, we will learn how to make more complex queries. By that I mean how to find data by entity field or make a count. You will be amazed how easy it is with Spring Data.

Entities

We will use Post entity from our previous part, update it and add an entity called User.

As you can see, we have defined relation between our entities. Every user can have multiple posts and each post has only 1 user.

Tasks

Let’s imagine we have a situation where we want to create a simple blogging system. We have to be able to:

1. find all active posts
2. find a post by an url
3. find all posts by a user
4. count all active posts

1. Find all active posts

We will use the PostRepository we have defined in our first part. By simply adding a method to the repository, it will generate the right code and map everything to a SQL. Because CrudRepository already has few basic methods prebuilt, we don’t need to add a method to find all posts. Instead, we can use findAll method.

But to find all active posts we have to define our custom method. Actually, it’s very simple.

This is it. This is the whole magic. One short line of code. But how it actually works? Spring Data will build the query based on method name and method return type. It will split the method name. In our case into find, by, isactive, true. First part defines to make a select query, the second indicates we want to filter, the third field name and fourth the value of the field. But be careful, defining field value after specifying the field name it will only work for booleans. For other field types, you need to pass the value as method argument. One great this is also that we can add multiple fields.

2. Find post by an url

Continuing the though from the previous section, we can have method build from different fields. For example, let’s load a post by it’s url. We have to update our repository.

Again, if we look at the method name, we will see that we are finding a post by it’s url. Because the url is a String, we have to pass the value as a method argument. Because return type is a Post, it will return one post. In case if the query returns multiple rows, then the exception will be thrown. Be careful that when you expect only 1 record, that you query by some unique field.

But actually, our blog system has to return a post by url and be active. We could have a code where we load the post by url and then check if isActive or not. Instead, we can do this in one query.

We are now querying the database by 2 field: url and isActive. When we use different fields in a method name, all of them are joined by AND. We cannot use OR. For that, we have to use some other approach (we will explain it in another tutorial).

3. Find all posts by a user

Every user has a username. Our task is to find all posts by a user or more specific, find all posts by a username. Writing a method name is actually the same, we just need to include the relation name. Again, we update our PostRepository.

Method name has to include the relation name. When we defined it in the entity, we have to use the same name in the method. If we change it in the entity, we also have to change the method name. It may look complicated and no really robust to changes, but there is no other way for Spring Data to know how to correctly build the query. Of course, as we mentioned it few times already, in the next part we will learn how we can actually use custom queries to help out Spring Data with building the native SQL query.

Once we define the relation in the method name, everything else is the same. We again filter by field name, we can or use multiple fields. But remember, for each relation field, we have to prepend the name with the relation name.

4. Count all active posts

For the last task, we have to count all active posts. CrudRepository already has a method called count(), but it will count all posts. We could use findByIsActiveTrue() method to find all posts and get a populated Set. All we have to do then is to call .size() and there, we have the count of all active posts.

Don’t do that. Sure, it works and it might even work in the production for a small number of posts, but in case of a larger dataset, it’s not a good practice. We have to fetch all the records, populate Set and then call .size() to just get one number. It’s a too big overhead.

Instead, we will use count which maps to SQL count. It’s much much faster and consumes much less resources. Before, we were finding records, so we prepended every method name with find. If we want to count, we have to do what? You are right, prepend every method name with count. Let’s for the last time update our PostRepository.

There are few differences compared to other method names. First is return type. It has to be a Long, so it will bind a row count to it (Integer can be too small). As we mentioned before, we start method with count and then define the field filters. It’s that simple.

Further reading

You can read all about how to correctly build the method name using the official docs. It has everything explained really nicely and it also demonstrates some additional keywords. I strongly recommend it.

Part 3 – What more will we learn?

In the next part, we will how we can make even more complex queries by using @Query annotation. @Query annotation enables us to write HQL, which is very similar to SQL but has a compile time checking. Another thing we will learn is how to extend Repository and use PersistanceManager to build super complex queries. We will create custom methods and insert them into repositories. It’s a really cool and advanced feature, so stay tuned.

ng-repeat with draggable or how to correctly use AngularJS with jQuery UI

AngularJS is an amazing framework. Together with jQuery and jQuery UI is a killer combo. But sometimes it’s really difficult to make them work together.

Task

Imagine we have a box (div) and inside some elements that we can drag around.

Solution #1

We will attach the jQuery UI draggable inside of the directive that we added to html (items-drag)

and create an app with a directive.

This works, but it’s totally unrealistic. In real life we probably load items from somewhere and populate the div. So let’s try that.

Solution #2

We add the controller ItemsController to the HTML with ng-repeat

and add controller to our app.

This will NOT work. Because when directive is loaded, it will find all spans and attach draggable to them. But because items are empty, it won’t find any spans. When they are loaded from the server ($timeout executes), ng-repeat will repeat and show items, but draggable will not be attached.

We can solve this by adding $watch and watching when items update and attach draggable. Let’s just update our directive.

This works. Great. But actually there is a big problem. When ng-repeat is adding the elements into the DOM, $watch method is fired and draggable is attached to items. Problem is that this happens during ng-repeat so draggable is not attached to all elements. What now?

Solution #3

We need to somehow wait for ng-repeat to finish and that all elements/items are loaded into DOM. Based on my research, there is no bulletproof way. Some suggest to use timeout.

This solution has one big problem. We cannot never set the right timeout time. If we set too small, it won’t work if we have a long list of items. If we set too large, then we can impact the user experience.

Solution #4 – The working one

The working solution is actually really simple and works for small or large lists of items without impacting the user experience.

We updated the directive’s element. We don’t attach directive to div#items anymore, but to each span. When each element/item is added to DOM, directive is fired and attaches draggable. So there is no timeouts or watching if $scope.items changed.

For me, this is the best way to combine AngularJS with jQuery UI – Draggable. Of course it also works for any plugin.