Category Archives: evaluation

NoSQL hype – Cassandra example

In last few weeks there was a lot of NoSQL hype, with more and more information about companies which migrate from rational databases such MySQL to NoSQL solution. There are a lot of pretty awesome NoSQL solution on the market, but from my point of view the most promising is Cassandra.

Originally Cassandra was developed by Facebook guys (btw the core developer was hired from amazon – he was one of the author of Amazon Dynamo). In 2008 Facebook open sourced Cassandra project and now it is developed by Apache.

Apache Cassandra Project was based on two awesome papers “Bigtable: A distributed storage system for structured data, 2006″ and “Dynamo: amazon’s highly available key- value store, 2007″ the result is:

Fault tolerant
Data is replicated and nodes can be replaced with no downtime.
Scalable
Read and Write throughput increase linearly with new nodes are added.
Proven
Digg, Facebook, Twitter and more, are for sure great example of usage.
Easy to use
High Level API, Java, Ruby, Python, Scala and more

Most of the API is done through Thrift it is also developed by Apache Foundation but still in incubator. It is framework for class language services development.

As Cassandra is NoSQL storage it does not have typical rational tables, instead of that it uses Data Structures such as:

Column
Tuple of name, value and timestamp
SuperColumn
Name and Map of
ColumnFamily
Is the infinite container for Columns
SuperColumnFamily
Is the infinite container for SuperColumns
Keyspace
Keyspace is the outer most grouping of your data.

Very nice introduction article for Cassandra Data Model.

The main problem with NoSQL databases is that, data modeling is completely different that relation database data modeling. Because we ask for a given key for some structured data, to achieve the best performance boost we should store data in proper way.

First at all NoSQL databases are not always better, so as always use the right tool to get job done without pain. We should decide do we need NoSQL database such as Cassandra. So … why we may want to use NoSQL solution:

  1. No single point of failure – the relational model is hard and expensive to be clustered (sequences, cascades, transactions, etc.), Oracle or MySQL database are focused on Consistency opposite to Cassandra (see CAP Theorem).
  2. Relational model theory is about normalization 1NF, 2NF, 3NF, and above, NoSQL make here difference, as we want to get all the needed data as single query we allow to duplicate data, so we are not structuring our data to be normalized, we want to structure our data for queries that are executed.
  3. With document store like Cassandra we have flexible scheme, so we may add and remove fields on the fly, this is huge pros as our deployment grows (hundreds of nodes)
  4. Most of setups with “normal” databases are Master (with mostly one “master” node) where writes operation goes, with Cassandra we have distributed writes so we can write data anywhere.

There are a lot of nice articles about installation of Cassandra, so I will just point them here:

As OSX user I’ve used last two, but highlighted link is worth to see as you can build Cassandra cluster.

To play a little bit with Cassandra we will use Cascal library which is hosted at GitHub. Cascal has pretty good documentation so if something is unclear refer to cascal wiki page. One additional important project is twissandra. It is example project to demonstrate how to use Cassandra, so to better understand Cassandra data model it is good to get that project and play with it a little bit.

The practical part is outdated, Cascal is outdated. Current list of drivers on cassandra planet page.

Summary

Cassandra is well-known for having no single point of failure, it is data storage for Facebook, Twitter and Digg. And what is most important here Cassandra has now Commercial support, so if your business don’t have time to learn and play with Cassandra. Now you may call Riptano. They provide Service and Training for Apache Cassandra.

Yes, Riptano is now called DataStax 🙂

Cassandra has now hers five minutes, and as we see she proves that it is worth to put some affords to learn NoSQL style data models. As always some of problems are ideal for rational data storage and some of them are typical for Cassandra it is good to have both tools in ours toolbox.

Meantime I was playing with Cassandra, Cassandra team have released version 0.6 and 0.6.1. The most important feature is Hadoop MapReduce support, there are also performance improvements with new caching layer. So as you see they moving fast :).

Yes they went really fast. I’ve played with 0.5.1 and now we have 2.1. I decided to publish this outdated post, but to make it interesting I’ve added Cassandra Time Machine.

References:

Cassandra Time Machine:

This  time machine shows us most important changes in major version of Cassandra. Despite of this changes there was a lot of bug fixes and improvements as well. I’ve done this time machine to appreciate work of Cassandra’s contributors. They did and still do great job!

0.6.x (2010)

The Cassandra’s team resolved 348 issues (part of them are port from 0.7x), there was thirteen releases. From 0.6.7 version all releases were bug fixing  ported form 0.7.x.

Features added:

  • Simple and very “stupid :)” Hadoop integration,
  • Dynamic endpoint snitch – “An endpoint snitch that automatically and dynamically infers “distance” to other machines without having to explicitly configure rack and datacenter positions solves two problems:”,
  • MX is accessible for none java client
  • Authorization and authentication (the beginning)
  • Per-keyspace replication factor (the beginning of replication strategy)
  • Row level cache
  • InProcessCassandraServer for testing purpose. Now it is replaced by EmbeddedCassandraService.
  • and many more minor features (ConsistencyLevel.ANY, ClusterProbe, Pretty-print column names, more JMX operations, global_snapshot and clear_global_snapshot commands, cleanup utility …)

0.7.x (2011)

This time they resolved 1006 issues and there was ten releases.

Features added:

  • Expiration time for column. Expired column acts as ‘markedForDelete’.
  • Configurable ‘merge factor’ for Column Families. MergeFactor attribute is used to tune read vs write performance for a ColumnFamily. A lower MergeFactor will cause compaction more frequently, leading to improved read performance at the cost of decreased write performance.
  • Allow creating indexes on existing data.
  • EC2Snitch – this snitch assumes  that EC2 region is a DC and  availability zone is a rack.
  • scrub command – rebuild sstables for one or more column family.
  • Removal operation which operates on key ranges and delete an entire columnfamily (truncate operation).
  • Weighted request scheduler.
  • and many more (access level for Thirft, many cassandra-cli improvements, NumericType  column comparator, support for Hadoop Streaming, cfhistograms, secondary indices for column families, JMX per node interface,

0.8.x (2011)

Last version before 1.0. Team resolved 549 issues and released ten versions.

Features added:

  • CQL (Cassandra Query Language) 1.0 language specification.
  • Idea of Coprocessors  (from hackathon) which was renamed to Plugins, which was implemented in 2.x as Triggers.
  • SeedProvider is pluggable via interface
  • Encryption support for internode communication (all, none).
  •  EC2 features for setting seeds and tokens (in EC2 machines die and bring up more frequently)
  • Compaction Throttling .
  • Support for batch insert/delete in CQL.
  • JDBC driver for CQL.
  • and many more (more commands in cli, different timeouts for different classes of operation, counter column support for SSTableExport, EndpointSnitchInfoMBean …)

1.0.x (2011)

First stable release. 510 issues were resolved, there was also twelve versions.

Features added:

  • SSTable compression – long waited feature (CASSANDRA-47). Most of the time it is good to exchange CPU for I/O.
  • Stream compression – Today we have Snappy, LZ4 and Deflate compression.
  • Checksum for compressed data to detect corrupted columns.
  • Better performance for rows with contains many (more than thousand) columns.
  • Max client timestamp for an SSTable is being captured and provided via SSTableMetadata.
  • Encryption for data across DC only.
  • Timing information to cassandra-cli queries – it looks like cosmetics, but is very handy.
  • Redesigned Compaction
  • CQL 1.1
  • and many more (RoundRobinScheduler metrics, overriding RING_DELAY, upgradesstables nodetool command, CQL improvements,  bloomfilter stats and memory size, ….)

1.1.x (2012)

One dot One line had twelve releases and the team resolved 620 issues.

Features added:

  • Concurrent Schema Migrations,
  • Prepared statements,
  • Infinite bootstrap – for new configuration testing purpose with live traffic. In this mode node would follow the bootstrap procedure as normal, but never fully join the ring.
  • Running Map/Reduce job with server side filtering.
  • Override of available processors value, so we can deploy multiple instances on single machine.
  • CompositeType comparator is now extendable.
  • Fine-grained control over data directories, so we can control what sstable are placed where.
  • Eagerly re-write data at read time.
  • Configurable transport in RecordReader and RecordWriter.
  • and many more (ALTER of Column Family attributes in CQL, Gossip goodbye command, loading from flat file, COPY TO command,  CQL support for describing keyspaces and column families,  rebuild index” JMX command , disable snapshots option,  “ALTER KEYSPACE” statement …)

1.2.x (2013)

This time Cassnadra team resolved 997 issues and released nineteen versions.

Features added:

  • Disallow client-provided timestamps, so WITH TIMESTAMP was ripped out.
  • Query tracing details – very helpful feature.
  • Ability to query collection types (list, set, and map) in CQL.
  • CQL 3.0 (better support for wide rows and generalization for composite columns, per-column family default consistency level)
  • Murmur3 partitioner which is faster then MD5.
  • Different timeout for reads and writes.
  • Atomic, eventually-consistent batches.
  • Compressed and off heap bloomfilters.
  • Global prepared statement instead of connection based.
  • Describe cluster for nodetool and cqlsh.
  • Metrics for native protocols and for global ColumnFamily.
  • Latency consistency analysis within nodetool. Users can accurately predict Cassandra’s behavior in their production environments without interfering with performance.
  • Custom CQL protocol and transport.
  • LZ4Compressor two time faster compression than Snappy.
  • LOCAL_ONE consistency level.
  • and many more (improved authentication logs, Multiple independent Level Compactions, UpgradeSSTables optimization, tombstone statistics in cfstats, ReverseCompaction,  resizing of threadpools via JMX, allow disabling the SlabAllocator, Notify before deleting SSTable…)

2.0.x (2013)

The 2.0 was released and here you find great Datastax article: What’s under the hood in Cassandra 2.0. There were ten versions and the team resolved 868 issues.

Features added:

  • Triggers – Asynchronous triggers is a basic mechanism to implement various use cases of asynchronous execution of application code at database side.
  • Query paging mechanism native CQL protocol.
  • Compare and Set Support (SET with IF statements).
  • Streaming 2.0.
  • Multiple ports to gossip from a single IP address this allow for multiple Cassandra service instances to run on a single machine, or from a group of machines behind a NAT.
  • CQL improvements.
  • Reduce Bloom Filter Garbage Allocation.
  • Network topology snitch supporting preferred addresses, so having cluster spanning multiple data centers, some in Amazon EC2 and some not is possible.
  • and many more ( index_interval configurable per column family, Single-pass compaction, Track sstable coldness, beforeChange Notification, CqlRecordReader, balance utility for vnodes, triggers LWT operations …..)

2.1.x (2014 during Cassandra Summit)

Two beta, six release candidate, 535 issues resolved. That’s great news. Datastax provided great articles about 2.1:

Final Word

Rafał provides great Cassandra Modeling Kata. It’s worth reading!

This post is original from April 2010. I’ve added citation to make some comments, and I’ve added Time Machine section (after orginal references). Currently the best place to start is DataStax Blog,  CassandraPlanet, and of course Twitter.

Some of futures where backported into previous version, that’s the reason they are in previous version (eg. CASSANDRA-5935 was fixed in 2.0.1 but also ported to 1.2.10 so it will show in 1.2.x).

DataStax in meantime became great company. They are behind Cassandra for many years providing stable and continuous growth. Currently valuation is around 830 million american dollars. If you have spear money I recommend you to invest in this company :). Last pre IPO round raised DataStax by $106 Million.

Model Driven Architecture live or dead?

In my wunderlist‘s “watch this” list I found this dinosaur movie to watch: MDA: A forlorn hope. by Uncle Bob. It was posted over two years ago and it was viewed almost 55 thousand times.

MDA is easy isn’t it? The MDA Guid has only 62 pages. We need some modeling tools and another tool for model to code transformation. Sounds easy 🙂

Few years ago, we’ve been using MDA approach in our project. We used MagicDraw for UML part and AndroMDA for UML to Java code generation (of course there are many other tools). From my point of view it was great experience. I share with my opinion below:

  • (pros) Model and Factory for free.
  • (pros) Hibernate mapping for free.
  • (pros) Documentation is up to date – you have to modify it to generate changes.
  • (pros) We focused on design, before coding.
  • (pros) All you hibernate mapping/DAO/etc is similar ( standardized – we can modify template)
  • (pros/cons) We can/have to change templates to align to ours standards.
  • (cons) Lot of  code you prefer to never read :/
  • (cons) Every time you change something, you have to regenerate code.
  • (cons) You have codebase divided into “read only” and “change here”.
  • (cons) If you forget to put additional metadata into model you’ll be doomed in future.
  • (cons) Once you change template, it is harder to upgrade tools and we have to regenerate code.

Nowadays I compare that experience to modern frameworks such as Rails or Django. I have to add comment here, what I really mean is that by using MDA approach I do not think about database mapping, DAO/repository object, I started at service level, and I have similar fillings when I’m using Rails or Django framework. I’m focusing on business logic, not how to get or save data into persistence storage.

Uncle Bob is talking about analyst as software creators, and this idea fortunately for my salary is impossible ;).

I totally agree, doing software is on much more detail level than model thinking, but … I think it wasn’t so bad to think about design on high level and then generate code and go deep in business logic details. How many times Java programers do the same job: create POJO, annotate or write XML descriptor, create DAO which looks more or less the same as another one, etc.

Let cite some smart guys from Uncle Bob post:

  • Uncle Bob: “Programmers are details managers – sorry MDA” 🙂
  • Comment: “MDA is actually based on two grand ideas:
    – raising the level of abstraction above programming language.
    – satisfying everyone with universal set of *standard* abstractions. Most of MDA failures is due to the the second idea, which is why MDA (not MDE in general) may be indeed a forlorn hope.”

I don’t want to force you to use MDA, just think about it and in …meantime … find the difference :).

Old times
Blog post before 1st May
nowadays
Blog post after 1st May

Yep, posterous is dead. Is MDA dead? Please share your opinion in comments.

 

Monitoring and metrics – Yes, of course.

I always try to convince everybody to measure. The first reaction are very different, but after a while everyone come back and says “Wow! I didn’t realize how helpful metrics can be”.

I’ve watched video: “Using Monitoring and Metrics to Learn in Development”  (by Patrick Debois from Atlassian) with pleasure. I use Atlassian tools and talk with few guys. They really care about code of their products. Patrick not only talks about metrics but also talks about technics and ideas which helps deliver better software.

Ideas from presentation:

  • smaller and frequent changes – easier to repair (dev for ops)
  • faster and better feedback – easier to find problem (ops for dev)
  • continuous integration maturity model – see slideshare presentation.
  • reuse “workflows” across environments by using virtualization – vagrant (great tool for building dev environments), puppet/chef (configuration automation and management)
  • infrastructure code repository and application code repository have to be in sync.
  • always remember that: “a lot of different monitoring levels we have” :).
  • monitoring driven development 🙂 – create a monitor check before implementing a feature (it is useless but you can think about similarities
  • monitoring tools grumble (around 00:19:20) – there is a projects by “Monitoring Sucks” (github), you can check but only one seems to be alive.
  • use monitoring as a service – this sound reasonable (pingdom, NewRelict, boundary, librato, and some others).
  • and  lot of other tools are mentioned – if you want to evaluate tools for metrics and monitoring you should watch video and note potential tools for evaluation.
  • always know the context of the metrics.
  • final thought: Metrics Driven Engineering (Etsy on the stage) – IMHO this is great idea to follow.

Whatever we will doing, it will happen our code do not work, our code stinks and become worser, but every time we figure out, we can do something with it. This is the reason it is so important to monitor our application continuously.

Patrick also mention about developer and operation responsibility sharing, he wrote a nice blog post: Just Enough Developed Infrastructure. I recommend you to read it too.