Pragmatic Programmer Issues

kdb+ high performance, column-oriented, designed for massive datasets database

Comments: 4

Today data volumes are growing continuously. We need to do complex processing in real time. kdb+ try to solve that problem. It  includes q vector processing programming language (SQL-like).

For example the New York Stock Exchange, which produce today a billion records per day ( in peaks over two billion), additionally it is over 200 GB data per day. kdb+ not only allow that, but additionally you can analyze data in real time.

I thinks it is worth to check, so this post is my kdb+ evaluation. First at all kdb+ is 64bit, unfortunately you can only evaluate 32bit – for me it is enough, but for large company it may be an issue. A lot of Is?, What?, Does? questions are answered in FAQ.

With kdb+ you can use:

  • kdb+taq – to analyze Trade and Quote (TAQ) database (unfortunately it isn’t free)
  • kdb+tick – to capture and analyze bilions of real-time events (tics), we can write new feed handlers (java,c) or use build-in (tibco, reuters, bloomberg).  One diagram is worth a thousand words (PDF)

The start with kdb+ is instant.

  1. Download evaluation 3.0 or 2.8.
  2. Unpack
  3.  Run q command, and you get REPL.
KDB+ 3.0 2012.12.20 Copyright (C) 1993-2012 Kx Systems
m32/ 8()core 8192MB sebastian.pietrowski c46024.local 192.168.1.15 PLAY 2013.03.20
q)

Next I went to Tutorials on Wiki Page. There are three tutorials right now:

  • starting kdb+ : First few question wanted to discourage me: “for serious evaluation of the product you need the help of a kdb+ consultant“, but I’m not serious man so I can handle without consultant :). Little bit later we get, nice tip, that q language has some visual tools. I’ve taken Studio For kdb+, it cross platform and there is special build for MacOS users. There is also Eclipse plugin, but I’m not Eclipse fan, but for Eclipse fans it should be perfect.  For those who never see q language and kdb+ it is nice point to start: We can learn:
    • how to switch from k-mode and q-mode by using \.
    • how to execute shell commands (/command_name).
    • learn about types, data structures, namespaces, error messages etc.
    • joins in q queries are much nicer – I would love to see this in SQL standard.
    • tcp/ip connection can be started with “q -p 5001” or “\p 5001“. The communication channel can be used synchronously or asynchronously. We can access and see all variables through web browser http://localhost:5001 (if we choose 5001) .
    • tables are created out of lists and they are created in memory, if needed they are written to disk and for big ones kdb+ will do partition (more information is in 6th chapter of third tutorial) .
    • The ? verb generates random data – it’s very useful.
    • historical database (hdb) holds data before today – typically is on the disk.
    • real-time database (rdb) stores today’s data in memory (will be stored to hbs next day) –  it is recommended to have 4 times RAM, than expected data size.
    • data feeds are time series data. A feedhandler converts the data stream into kdb+ database.
  • Q for mortals: this tutorial is about Q language, so you can learn about origins, author, philosophy etc. of Q language.
    • For sure it is very good idea to bookmark it: It is worth to distinguish between 0w, 0W, 0n 0Wh etc (Positive float infinity, NaN, or not a number, Positive int infinity, etc)
    • I think more languages should introduce such rules “… so specifying more than eight arguments will cause an error” ;), also recommendation such as that is awesome: “When a function exceeds 20 expressions, you should ask yourself if it can be factored.
    • folding is great here we have projection and even multiple projection.
    • adverbs are pretty cool, aren’t they?
    • q-sql kills my brain – but for now it is OK, I’m not Q-consaltant 🙂
    • there is a lot of licenses error messages, try to guess that one ‘wha‘.
    • References should be bookmarked as you will need them almost all the time.
  • and kdb+ for mortals: The last tutorial describe internals, here we can find really interesting stuff – regarding we will use kdb+/q or not is worth reading.
    • Tables with many columns should be splayed – most queries refer to subset of columns (only those columns will be loaded).
    • Symbols vs. Strings – gold rule is “fields that are invariant (e.g., keys), are drawn from a controlled domain, or will primarily need equality testing are excellent candidates for symbols – e.g.  exchange symbols” and “Fields that need selection on content or are rarely referenced are good candidates for string fields—e.g., comments, audit notes“.
    • Symbols vs Strings (grey area) – “fields such as an alphanumeric order ID. Such fields are unique and invariant but the domain is not limited, as there will be a significant number of new variants each day. Further, the values may be recycled over time. The best advice is to consider your use cases carefully, paying attention to the likely long-term disposition of the ID values“.
    • For partitioned tables kdb+ will use for aggregate functions map-reduce algorithm .
    •  I/O bandwidth limits performance on large partitioned tables. It is possible to spread a table across multiple I/O channels to facilitate parallel retrieval.
    • Segmentation spreads a partitioned table’s records across multiple storage locations.
    • Two tables that share the same partitioning are not required to share the same segmentation decomposition.
    • In the end if you are lost as I were, the tutorial come with help 🙂
      whereami:{-1 "cd ~ ",system "cd";}

First idea to check kdb+ is to work on large excel files. Instead of working in excel I can use q-sql.

cols: ("ISI";",") 0:`data.csv          /read excel as columns
tab: flip `col_a`col_b`col_c!cols      /create table from columns
...                                    /do magic with q-sql
save `tab.csv                          /save to file - save is smart so: no extension - binary file;
                                       / csv - ','; txt - tab;
                                       / xls - excel file; xml - xml file;

It simple and easy, and from my point of view SQL beats VB, but I will no pay for kdb+ to improve Excel :).

The real use case is such as NYSE, we have flood of data coming, and we want to process them, also we want do analysis in real time. Most solution is able to process data, but we have data divided into two subset:

  1. Live window of data (minutes, hours, days) – to process as fast as possible
  2. Historical data – to do analytic part

Also it will be useful in all Hadoop scenarios (eg. log processing)- but in real time.

I checked how it would perform on my Mac (2.2 Corei7 8Gb RAM) I got this results:

  • 1.126 million inserts per second (single insert)
  • 22.727 million inserts per second (bulk insert 10)
  • 125 million inserts per second (bulk insert 100)
  • 200 million inserts per second (bulk insert 1000)
  • 200 million inserts per second (bulk insert 10000)

Of course it is just micro benchmark.

Integration is also good point of kdb+ as it is integrated with major programing languages such as: C/C++, Java, .Net, R, Matlab, Perl, Python

Conclusions:

If you need single solution for real time data with analytics then you should consider kdb+ in your evaluation. Kdb+ stores database as ordinary native files, so it do not have any special needs regarding hardware and storage architecture. It worth to point out that database are just files, so your administrative work won’t be difficult (standard tools).

The architecture of kdb+ was build to be extremely fast: 64-bits to handle enormous amount of data, columnar structure to simplify indexing and joining (also memory footprint), date atoms are basic data types that make time-ordered analysis fast and efficient. And last but not least multi-core and multi-threading processing make a huge  benefit here.

And the benefit is that we  can process millions of records per seconds.

Happy evaluation!

 

And one more thing, if you are searching for some solution it worth to checkout svn contrib repository and/or Contrib Wiki Page as well.

Categories

Comments

JohnDJohnD

Nice concise article Pedrowaty .

The RAM required = 4*(largest dataset)
is due to:
1 set of data to have it in memory
1 set of data that we are working on and modifying
1 set of serialization data, as message is queued to send to client
1 for luck 🙂

If your looking for a GUI I’d like to recommend qStudio
http://www.timestored.com/qstudio/

– John

pedropedro

Good catch – not whole data set, but working data set (more or less data from one day) and of course nodes sum up.

pedropedro

DNA – możesz sobie dam badać 😛

pedropedro

Yes, that’s true (over year). I hope things will change 🙂