Category Archives: BigData

NoSQL hype – Cassandra example

In last few weeks there was a lot of NoSQL hype, with more and more information about companies which migrate from rational databases such MySQL to NoSQL solution. There are a lot of pretty awesome NoSQL solution on the market, but from my point of view the most promising is Cassandra.

Originally Cassandra was developed by Facebook guys (btw the core developer was hired from amazon – he was one of the author of Amazon Dynamo). In 2008 Facebook open sourced Cassandra project and now it is developed by Apache.

Apache Cassandra Project was based on two awesome papers “Bigtable: A distributed storage system for structured data, 2006″ and “Dynamo: amazon’s highly available key- value store, 2007″ the result is:

Fault tolerant
Data is replicated and nodes can be replaced with no downtime.
Read and Write throughput increase linearly with new nodes are added.
Digg, Facebook, Twitter and more, are for sure great example of usage.
Easy to use
High Level API, Java, Ruby, Python, Scala and more

Most of the API is done through Thrift it is also developed by Apache Foundation but still in incubator. It is framework for class language services development.

As Cassandra is NoSQL storage it does not have typical rational tables, instead of that it uses Data Structures such as:

Tuple of name, value and timestamp
Name and Map of
Is the infinite container for Columns
Is the infinite container for SuperColumns
Keyspace is the outer most grouping of your data.

Very nice introduction article for Cassandra Data Model.

The main problem with NoSQL databases is that, data modeling is completely different that relation database data modeling. Because we ask for a given key for some structured data, to achieve the best performance boost we should store data in proper way.

First at all NoSQL databases are not always better, so as always use the right tool to get job done without pain. We should decide do we need NoSQL database such as Cassandra. So … why we may want to use NoSQL solution:

  1. No single point of failure – the relational model is hard and expensive to be clustered (sequences, cascades, transactions, etc.), Oracle or MySQL database are focused on Consistency opposite to Cassandra (see CAP Theorem).
  2. Relational model theory is about normalization 1NF, 2NF, 3NF, and above, NoSQL make here difference, as we want to get all the needed data as single query we allow to duplicate data, so we are not structuring our data to be normalized, we want to structure our data for queries that are executed.
  3. With document store like Cassandra we have flexible scheme, so we may add and remove fields on the fly, this is huge pros as our deployment grows (hundreds of nodes)
  4. Most of setups with “normal” databases are Master (with mostly one “master” node) where writes operation goes, with Cassandra we have distributed writes so we can write data anywhere.

There are a lot of nice articles about installation of Cassandra, so I will just point them here:

As OSX user I’ve used last two, but highlighted link is worth to see as you can build Cassandra cluster.

To play a little bit with Cassandra we will use Cascal library which is hosted at GitHub. Cascal has pretty good documentation so if something is unclear refer to cascal wiki page. One additional important project is twissandra. It is example project to demonstrate how to use Cassandra, so to better understand Cassandra data model it is good to get that project and play with it a little bit.

The practical part is outdated, Cascal is outdated. Current list of drivers on cassandra planet page.


Cassandra is well-known for having no single point of failure, it is data storage for Facebook, Twitter and Digg. And what is most important here Cassandra has now Commercial support, so if your business don’t have time to learn and play with Cassandra. Now you may call Riptano. They provide Service and Training for Apache Cassandra.

Yes, Riptano is now called DataStax 🙂

Cassandra has now hers five minutes, and as we see she proves that it is worth to put some affords to learn NoSQL style data models. As always some of problems are ideal for rational data storage and some of them are typical for Cassandra it is good to have both tools in ours toolbox.

Meantime I was playing with Cassandra, Cassandra team have released version 0.6 and 0.6.1. The most important feature is Hadoop MapReduce support, there are also performance improvements with new caching layer. So as you see they moving fast :).

Yes they went really fast. I’ve played with 0.5.1 and now we have 2.1. I decided to publish this outdated post, but to make it interesting I’ve added Cassandra Time Machine.


Cassandra Time Machine:

This  time machine shows us most important changes in major version of Cassandra. Despite of this changes there was a lot of bug fixes and improvements as well. I’ve done this time machine to appreciate work of Cassandra’s contributors. They did and still do great job!

0.6.x (2010)

The Cassandra’s team resolved 348 issues (part of them are port from 0.7x), there was thirteen releases. From 0.6.7 version all releases were bug fixing  ported form 0.7.x.

Features added:

  • Simple and very “stupid :)” Hadoop integration,
  • Dynamic endpoint snitch – “An endpoint snitch that automatically and dynamically infers “distance” to other machines without having to explicitly configure rack and datacenter positions solves two problems:”,
  • MX is accessible for none java client
  • Authorization and authentication (the beginning)
  • Per-keyspace replication factor (the beginning of replication strategy)
  • Row level cache
  • InProcessCassandraServer for testing purpose. Now it is replaced by EmbeddedCassandraService.
  • and many more minor features (ConsistencyLevel.ANY, ClusterProbe, Pretty-print column names, more JMX operations, global_snapshot and clear_global_snapshot commands, cleanup utility …)

0.7.x (2011)

This time they resolved 1006 issues and there was ten releases.

Features added:

  • Expiration time for column. Expired column acts as ‘markedForDelete’.
  • Configurable ‘merge factor’ for Column Families. MergeFactor attribute is used to tune read vs write performance for a ColumnFamily. A lower MergeFactor will cause compaction more frequently, leading to improved read performance at the cost of decreased write performance.
  • Allow creating indexes on existing data.
  • EC2Snitch – this snitch assumes  that EC2 region is a DC and  availability zone is a rack.
  • scrub command – rebuild sstables for one or more column family.
  • Removal operation which operates on key ranges and delete an entire columnfamily (truncate operation).
  • Weighted request scheduler.
  • and many more (access level for Thirft, many cassandra-cli improvements, NumericType  column comparator, support for Hadoop Streaming, cfhistograms, secondary indices for column families, JMX per node interface,

0.8.x (2011)

Last version before 1.0. Team resolved 549 issues and released ten versions.

Features added:

  • CQL (Cassandra Query Language) 1.0 language specification.
  • Idea of Coprocessors  (from hackathon) which was renamed to Plugins, which was implemented in 2.x as Triggers.
  • SeedProvider is pluggable via interface
  • Encryption support for internode communication (all, none).
  •  EC2 features for setting seeds and tokens (in EC2 machines die and bring up more frequently)
  • Compaction Throttling .
  • Support for batch insert/delete in CQL.
  • JDBC driver for CQL.
  • and many more (more commands in cli, different timeouts for different classes of operation, counter column support for SSTableExport, EndpointSnitchInfoMBean …)

1.0.x (2011)

First stable release. 510 issues were resolved, there was also twelve versions.

Features added:

  • SSTable compression – long waited feature (CASSANDRA-47). Most of the time it is good to exchange CPU for I/O.
  • Stream compression – Today we have Snappy, LZ4 and Deflate compression.
  • Checksum for compressed data to detect corrupted columns.
  • Better performance for rows with contains many (more than thousand) columns.
  • Max client timestamp for an SSTable is being captured and provided via SSTableMetadata.
  • Encryption for data across DC only.
  • Timing information to cassandra-cli queries – it looks like cosmetics, but is very handy.
  • Redesigned Compaction
  • CQL 1.1
  • and many more (RoundRobinScheduler metrics, overriding RING_DELAY, upgradesstables nodetool command, CQL improvements,  bloomfilter stats and memory size, ….)

1.1.x (2012)

One dot One line had twelve releases and the team resolved 620 issues.

Features added:

  • Concurrent Schema Migrations,
  • Prepared statements,
  • Infinite bootstrap – for new configuration testing purpose with live traffic. In this mode node would follow the bootstrap procedure as normal, but never fully join the ring.
  • Running Map/Reduce job with server side filtering.
  • Override of available processors value, so we can deploy multiple instances on single machine.
  • CompositeType comparator is now extendable.
  • Fine-grained control over data directories, so we can control what sstable are placed where.
  • Eagerly re-write data at read time.
  • Configurable transport in RecordReader and RecordWriter.
  • and many more (ALTER of Column Family attributes in CQL, Gossip goodbye command, loading from flat file, COPY TO command,  CQL support for describing keyspaces and column families,  rebuild index” JMX command , disable snapshots option,  “ALTER KEYSPACE” statement …)

1.2.x (2013)

This time Cassnadra team resolved 997 issues and released nineteen versions.

Features added:

  • Disallow client-provided timestamps, so WITH TIMESTAMP was ripped out.
  • Query tracing details – very helpful feature.
  • Ability to query collection types (list, set, and map) in CQL.
  • CQL 3.0 (better support for wide rows and generalization for composite columns, per-column family default consistency level)
  • Murmur3 partitioner which is faster then MD5.
  • Different timeout for reads and writes.
  • Atomic, eventually-consistent batches.
  • Compressed and off heap bloomfilters.
  • Global prepared statement instead of connection based.
  • Describe cluster for nodetool and cqlsh.
  • Metrics for native protocols and for global ColumnFamily.
  • Latency consistency analysis within nodetool. Users can accurately predict Cassandra’s behavior in their production environments without interfering with performance.
  • Custom CQL protocol and transport.
  • LZ4Compressor two time faster compression than Snappy.
  • LOCAL_ONE consistency level.
  • and many more (improved authentication logs, Multiple independent Level Compactions, UpgradeSSTables optimization, tombstone statistics in cfstats, ReverseCompaction,  resizing of threadpools via JMX, allow disabling the SlabAllocator, Notify before deleting SSTable…)

2.0.x (2013)

The 2.0 was released and here you find great Datastax article: What’s under the hood in Cassandra 2.0. There were ten versions and the team resolved 868 issues.

Features added:

  • Triggers – Asynchronous triggers is a basic mechanism to implement various use cases of asynchronous execution of application code at database side.
  • Query paging mechanism native CQL protocol.
  • Compare and Set Support (SET with IF statements).
  • Streaming 2.0.
  • Multiple ports to gossip from a single IP address this allow for multiple Cassandra service instances to run on a single machine, or from a group of machines behind a NAT.
  • CQL improvements.
  • Reduce Bloom Filter Garbage Allocation.
  • Network topology snitch supporting preferred addresses, so having cluster spanning multiple data centers, some in Amazon EC2 and some not is possible.
  • and many more ( index_interval configurable per column family, Single-pass compaction, Track sstable coldness, beforeChange Notification, CqlRecordReader, balance utility for vnodes, triggers LWT operations …..)

2.1.x (2014 during Cassandra Summit)

Two beta, six release candidate, 535 issues resolved. That’s great news. Datastax provided great articles about 2.1:

Final Word

Rafał provides great Cassandra Modeling Kata. It’s worth reading!

This post is original from April 2010. I’ve added citation to make some comments, and I’ve added Time Machine section (after orginal references). Currently the best place to start is DataStax Blog,  CassandraPlanet, and of course Twitter.

Some of futures where backported into previous version, that’s the reason they are in previous version (eg. CASSANDRA-5935 was fixed in 2.0.1 but also ported to 1.2.10 so it will show in 1.2.x).

DataStax in meantime became great company. They are behind Cassandra for many years providing stable and continuous growth. Currently valuation is around 830 million american dollars. If you have spear money I recommend you to invest in this company :). Last pre IPO round raised DataStax by $106 Million.

CloudFront Joined AWS Free Usage Tier

Amazon CloudFront is Content Delivery Network (CDN) and more as it is integrated and optimized to work with other AWS services. By using edge locations of Amazon’s DCs we can cache our content and deliver it with low latency. It doesn’t meter if the content is static (S3 object) or dynamic (EC2 service). We can deliver entire website, or cache part of it to improve performance (static content at edge location means faster download of images, js etc. ).  We should it consider in our mobile apps, when latency is challenging. You will find detailed information on product page.

So what’s the big deal with AWS Free Usage Tier?  AWS Free Usage Tier is how AWS promote services. We can use certain services for free. YES for FREE, of course with some restriction such as: storage size, total requests, hours, etc. For us it means we can build quite successful web page for free. On the other hand we can just run some tests to find out if particular solution works for us. It’s marketing for AWS, and opportunity for us.

Amazon CloudFront free tier allows for for 2,000,000 HTTP and HTTPS Requests each month for one year. For small business or testing purpose it’s enough.

If we can use it something for free it is good to mention it. In our profession we have to learn all the time, and access for free gives us opportunity to play a little bit before buying. For working professionals paying few bucks is not a problem, but for students it might be.

Currently AWS Free Usage Tier has:

  • Amazon EC2 – so we have some servers for free
  • Amazon S3 – so we have some static object for free
  • Amazon RDS – so we have database for free
  • Amazon CloudWatch – so we monitor for free
  • Amazon EBS – so we have storage for free
  • Amazon SES – so we can send email for free
  • Amazon SQS – so we have messaging for free
  • Amazon SNS – so we have notification for free
  • Amazon SWF – so we have workflows for free
  • and few more …

and of course Amazon CloudFront for free so we have CDN

In my opinion a lot of small businesses can start for free, and if somebody succeed than …… you know FB or Google will pay few millions  and will provide infrastructure 🙂


kdb+ high performance, column-oriented, designed for massive datasets database

Today data volumes are growing continuously. We need to do complex processing in real time. kdb+ try to solve that problem. It  includes q vector processing programming language (SQL-like).

For example the New York Stock Exchange, which produce today a billion records per day ( in peaks over two billion), additionally it is over 200 GB data per day. kdb+ not only allow that, but additionally you can analyze data in real time.

I thinks it is worth to check, so this post is my kdb+ evaluation. First at all kdb+ is 64bit, unfortunately you can only evaluate 32bit – for me it is enough, but for large company it may be an issue. A lot of Is?, What?, Does? questions are answered in FAQ.

With kdb+ you can use:

  • kdb+taq – to analyze Trade and Quote (TAQ) database (unfortunately it isn’t free)
  • kdb+tick – to capture and analyze bilions of real-time events (tics), we can write new feed handlers (java,c) or use build-in (tibco, reuters, bloomberg).  One diagram is worth a thousand words (PDF)

The start with kdb+ is instant.

  1. Download evaluation 3.0 or 2.8.
  2. Unpack
  3.  Run q command, and you get REPL.
KDB+ 3.0 2012.12.20 Copyright (C) 1993-2012 Kx Systems
m32/ 8()core 8192MB sebastian.pietrowski c46024.local PLAY 2013.03.20

Next I went to Tutorials on Wiki Page. There are three tutorials right now:

  • starting kdb+ : First few question wanted to discourage me: “for serious evaluation of the product you need the help of a kdb+ consultant“, but I’m not serious man so I can handle without consultant :). Little bit later we get, nice tip, that q language has some visual tools. I’ve taken Studio For kdb+, it cross platform and there is special build for MacOS users. There is also Eclipse plugin, but I’m not Eclipse fan, but for Eclipse fans it should be perfect.  For those who never see q language and kdb+ it is nice point to start: We can learn:
    • how to switch from k-mode and q-mode by using \.
    • how to execute shell commands (/command_name).
    • learn about types, data structures, namespaces, error messages etc.
    • joins in q queries are much nicer – I would love to see this in SQL standard.
    • tcp/ip connection can be started with “q -p 5001” or “\p 5001“. The communication channel can be used synchronously or asynchronously. We can access and see all variables through web browser http://localhost:5001 (if we choose 5001) .
    • tables are created out of lists and they are created in memory, if needed they are written to disk and for big ones kdb+ will do partition (more information is in 6th chapter of third tutorial) .
    • The ? verb generates random data – it’s very useful.
    • historical database (hdb) holds data before today – typically is on the disk.
    • real-time database (rdb) stores today’s data in memory (will be stored to hbs next day) –  it is recommended to have 4 times RAM, than expected data size.
    • data feeds are time series data. A feedhandler converts the data stream into kdb+ database.
  • Q for mortals: this tutorial is about Q language, so you can learn about origins, author, philosophy etc. of Q language.
    • For sure it is very good idea to bookmark it: It is worth to distinguish between 0w, 0W, 0n 0Wh etc (Positive float infinity, NaN, or not a number, Positive int infinity, etc)
    • I think more languages should introduce such rules “… so specifying more than eight arguments will cause an error” ;), also recommendation such as that is awesome: “When a function exceeds 20 expressions, you should ask yourself if it can be factored.
    • folding is great here we have projection and even multiple projection.
    • adverbs are pretty cool, aren’t they?
    • q-sql kills my brain – but for now it is OK, I’m not Q-consaltant 🙂
    • there is a lot of licenses error messages, try to guess that one ‘wha‘.
    • References should be bookmarked as you will need them almost all the time.
  • and kdb+ for mortals: The last tutorial describe internals, here we can find really interesting stuff – regarding we will use kdb+/q or not is worth reading.
    • Tables with many columns should be splayed – most queries refer to subset of columns (only those columns will be loaded).
    • Symbols vs. Strings – gold rule is “fields that are invariant (e.g., keys), are drawn from a controlled domain, or will primarily need equality testing are excellent candidates for symbols – e.g.  exchange symbols” and “Fields that need selection on content or are rarely referenced are good candidates for string fields—e.g., comments, audit notes“.
    • Symbols vs Strings (grey area) – “fields such as an alphanumeric order ID. Such fields are unique and invariant but the domain is not limited, as there will be a significant number of new variants each day. Further, the values may be recycled over time. The best advice is to consider your use cases carefully, paying attention to the likely long-term disposition of the ID values“.
    • For partitioned tables kdb+ will use for aggregate functions map-reduce algorithm .
    •  I/O bandwidth limits performance on large partitioned tables. It is possible to spread a table across multiple I/O channels to facilitate parallel retrieval.
    • Segmentation spreads a partitioned table’s records across multiple storage locations.
    • Two tables that share the same partitioning are not required to share the same segmentation decomposition.
    • In the end if you are lost as I were, the tutorial come with help 🙂
      whereami:{-1 "cd ~ ",system "cd";}

First idea to check kdb+ is to work on large excel files. Instead of working in excel I can use q-sql.

cols: ("ISI";",") 0:`data.csv          /read excel as columns
tab: flip `col_a`col_b`col_c!cols      /create table from columns
...                                    /do magic with q-sql
save `tab.csv                          /save to file - save is smart so: no extension - binary file;
                                       / csv - ','; txt - tab;
                                       / xls - excel file; xml - xml file;

It simple and easy, and from my point of view SQL beats VB, but I will no pay for kdb+ to improve Excel :).

The real use case is such as NYSE, we have flood of data coming, and we want to process them, also we want do analysis in real time. Most solution is able to process data, but we have data divided into two subset:

  1. Live window of data (minutes, hours, days) – to process as fast as possible
  2. Historical data – to do analytic part

Also it will be useful in all Hadoop scenarios (eg. log processing)- but in real time.

I checked how it would perform on my Mac (2.2 Corei7 8Gb RAM) I got this results:

  • 1.126 million inserts per second (single insert)
  • 22.727 million inserts per second (bulk insert 10)
  • 125 million inserts per second (bulk insert 100)
  • 200 million inserts per second (bulk insert 1000)
  • 200 million inserts per second (bulk insert 10000)

Of course it is just micro benchmark.

Integration is also good point of kdb+ as it is integrated with major programing languages such as: C/C++, Java, .Net, R, Matlab, Perl, Python


If you need single solution for real time data with analytics then you should consider kdb+ in your evaluation. Kdb+ stores database as ordinary native files, so it do not have any special needs regarding hardware and storage architecture. It worth to point out that database are just files, so your administrative work won’t be difficult (standard tools).

The architecture of kdb+ was build to be extremely fast: 64-bits to handle enormous amount of data, columnar structure to simplify indexing and joining (also memory footprint), date atoms are basic data types that make time-ordered analysis fast and efficient. And last but not least multi-core and multi-threading processing make a huge  benefit here.

And the benefit is that we  can process millions of records per seconds.

Happy evaluation!


And one more thing, if you are searching for some solution it worth to checkout svn contrib repository and/or Contrib Wiki Page as well.