Grail Quest to find where went our productivity to?

I had talked with friend, he was worried that his team was not delivering so fast that he was expected. He used to be a programmer, he is a manager today, and he remembers that “at that time” he delivers much faster, than it is delivered right now.

He is confused because after all, the developer world is getting easier … He works in a company that takes time and attention to make the programmers life better: continuous deployments with the right platform, automatic monitoring and much more.

In addition, the open source movement brings value, most of the time we just choose solution for our needs. The OSS ecosystem is huge today, just name a few:

 Why is this happening? Why is friend’s team productivity suffers?

The problem is not trivial, it seems that today’s software engineer has everything. In addition to open source libraries, he has tools available to deliver the product quickly and in excellent quality. An extensive development environment (IDE) with support for automated code refactoring that measures its quality and prompts us to code as we type.

All of this comes with tools that support continuous integration, code deployment, and support for multiple cloud environments. The developer does not have to wait weeks for a new server or software installation.

Looking at portfolios of cloud providers we can agree that the developer has everything he or she needs to get the job done. From machines (IaaS), storage, databases (various types), queuing systems, even algorithms, specific types of services (such as speech or image recognition) and ending with solutions for processing large data sets. You can go and check one of cloud providers:

What else do we need?

To go deeper into the problem, I cite two other friends who expressed doubts in “fast delivery of the value”.

The first one:  after hours he works on various contracted products. Talking recently he said something that made me think. Doing after hours brings more value in one week,  than his whole team together. The team is about 7 developers and we can assume that”after hours” means about 50% of the standard time of work. It means he is 14 times more productive. Puzzling!

Another one during the conversation said: “You know Pedro, there is work for a few hours, unfortunately as we pass it through SCRUM, it will take us two weeks.” Hmm … again assuming that a few hours is less than a day’s work, and two weeks is 10 working days – it means we are 10 times less productive.

Scary! What goes wrong? What is the problem?

While the statement of the first colleague does not give us a hard evidence. The second statement clearly points to the scrum methodology. It points, that certain ceremonies or procedures cause the “same work” to last longer.

Scrum is a great framework to organize and build software products, however, I have seen many times what things went wrong. What is it like? I would point out a number of distortions which, from a reasonable set of rules (eg. SCRUM), create “the RIGHT to success rules, which you must OBLIGATORILY follow, otherwise you and your project are DOOMED”.
Characteristic for these rights are:

  • If you follow these laws, your productivity will increase (hmm … doing daily meetings solves all problems)
  • If you do not follow a “law,” you should be punished (a cap or t-shirt that reads “I was late for daily”  and so on).
  • You follow the law, you should be proud of it – you are doing (surely :/) a good job (it does not matter that the implementation delayed another week – important that you were on the planning).
  • Did not succeed? Still not working productively? – it means that you do not follow the law or follow it in wrong way (daily at bad time … daily not in the morning …, or did not do grooming …., or did grooming to late/to early …. etc)

The side effect is that since we have “a set of rights” and have to follow them. We measure our success in the context of correlation with the laws we introduced. The “Institution of Measurements” is formed. After a while we start to measure our busyness – no matter if what we do make sense, what is important that we follow the law and all the indicators show that we are doing it wonderfully! 😉

“If We Measure Busyness, We’ll Create More Busyness.”

Applying these rules often leads to running with so-called “empty wheelchairs” where it’s important to run, because that’s what it’s all about. In this way, “accidentally” we reach the essence: the disposal of the resource (human one).

Resource Efficiency principle: If you can not finish the job, switch to the next task. Do so, until you find yourself in all the time switching from task to task and/or waiting for others to finish the task you waiting for.

From business owner point of view, the best utilization is when a person is working 100% of the time without a break for coffee. Once I heard from C-Level “You have to tell your people to work faster.”,  but in my experience such a situation is ridiculous, as there is usually something in the system that causes the work to be slow down and the “work faster, do more, get more people” change nothing in this situation (many times things go even worser).

Pretty good analogy is the highway. If there is a refurbishment works or narrowing of the motorway, more cars will only slow down the number of cars that are able to pass (work done) and increase speed is a high probability of accident and complete blocking of the motorway. The more cars, the slower, and finally total traffic jam (business projects to be realized).

Paradoxically less cars and less speed increases the throughput (work done). If worker do not work we can say that, in both situations, velocity is the same (close to zero), with quite different utilization (close to 0 versus close to 100%).

The first part ends here.

Next time you want to add more people to move faster – stop – and think what bottlenecks are in your environment. It might be one huge codebase with several teams develop different features get in their way, or lack of knowledge, or lack of automation in micro-services world. There are not one solution to fit all yours needs…

By the way, as Tech Rebels we help companies go through technical hurdles. Just hire us, and we will bring your organization to next technical level.

Book review: The Business Value of Web Performance

Web performance is always very important to me. Browser need to download more and more resources to render full page. On one hand we want our pages look gorgeous on the other hand this means more images, more bandwidth and of course more time.

In todays mobile world this become even more important, we are in the hurry and if page didn’t show up in 1-2 second we’re swipe away. I read over a book  “Time is Money” by Tammy Everts. It is very short one (about 100 pages) you can read it in one day. I recommend you to get this book and read it.

In first two chapters, author shows a lot of case studies and scientific background to convince us that performance is important. Fast websites create happy users – lot of  case studies proves that we perceive slow webpage even slower than it is. Research show that there are two really important aspects of our brains:

  • Short term memory – which is very poor and decays quickly
  • Need to feel we are in control – if we have to wait, we feel powerless, so we’re starting to frustrate.

That means that 0.1 second give us illusion of instantaneousness; after 10 seconds we lose focus. We need fast application to gain “Flow”, otherwise we are getting tired.

100ms is Google stated goal to page load times

In the book there are plenty of examples how few seconds can change conversion rates of our site. So finally I think this quote get to the point.

“First and foremost, we believe that speed is more than a feature. Speed is the most important feature. If your application is slow, people won’t use it.”

– Fred Wilson, VC, Union Square Ventures.

Digging deeper we find out that the problem is latency and bandwidth. Important thing to know is that increasing bandwidth by 1000% we improve page load time by about 50%. We should consider that many browser has limit of simultaneous connection, and this limits may vary depending on version and OS. That means latency is what we should care about. The big problem with latency is that it’s unpredictable and inconsistent. It can be affected by almost everything from the weather 🙂 to what your neighbours are downloading.

Regardless how much money we invest in building out the infrastructure – latency will continue to be one of the greatest obstacles in web performance.

Next we are moving to optimisation layers, where we can measure and optimise response times. These layers are:

  • Servers
  • Load Balancers
  • Content Delivery Networks (CDNs)
  • Frontend Optimalization (FEO)

This layers has different optimization purposes, but the rule of thumb says that 80%-90% of the end user response time is spent in frontend, so it is good idea to start there. Mobile optimisation are VIP here, so we should always figure out how to decrease request count and how to minimise response size.

To know where we should put our effort, to optimise proper page, or part of our application we have to measure it and we need to know which pages are more important from the customer conversion point of view. There are a lot of different measurement tools available: mainly two types Application Performance Management (APM) and Digital Performance Management (DPM)

It might happen that cart page is not so important than welcome page or vice versa. We should figure out what impact does web performance have on Customer Lifetime Value (CLV).

The very important question is “How Fast is Fast Enough?” and let me cite the author:

“optimising performance is like painting the Golden Gate Bridge: it never ends.”

Let summary this, performance is very important. Carrying about performance gives us better conversion rates from our customers and makes them happy. It is very important to build culture of performance in your company. To do this, it is good to show them use cases to demonstrate the value of performance – and the best are your own case. Building such wide knowledge on both business and technical side.

Finally everyone who touches a web page from people on the business side – looking to add third-party analytics tags; to people from the marketing team, wanting – to add high-resolution hero images ; needs to know that their decisions has impact on performance. Impact on performance is the same as on revenue – we can increase or decrease it.

And one final quote:

Remember that performance is a journey, not a destination.

NoSQL hype – Cassandra example

In last few weeks there was a lot of NoSQL hype, with more and more information about companies which migrate from rational databases such MySQL to NoSQL solution. There are a lot of pretty awesome NoSQL solution on the market, but from my point of view the most promising is Cassandra.

Originally Cassandra was developed by Facebook guys (btw the core developer was hired from amazon – he was one of the author of Amazon Dynamo). In 2008 Facebook open sourced Cassandra project and now it is developed by Apache.

Apache Cassandra Project was based on two awesome papers “Bigtable: A distributed storage system for structured data, 2006″ and “Dynamo: amazon’s highly available key- value store, 2007″ the result is:

Fault tolerant
Data is replicated and nodes can be replaced with no downtime.
Scalable
Read and Write throughput increase linearly with new nodes are added.
Proven
Digg, Facebook, Twitter and more, are for sure great example of usage.
Easy to use
High Level API, Java, Ruby, Python, Scala and more

Most of the API is done through Thrift it is also developed by Apache Foundation but still in incubator. It is framework for class language services development.

As Cassandra is NoSQL storage it does not have typical rational tables, instead of that it uses Data Structures such as:

Column
Tuple of name, value and timestamp
SuperColumn
Name and Map of
ColumnFamily
Is the infinite container for Columns
SuperColumnFamily
Is the infinite container for SuperColumns
Keyspace
Keyspace is the outer most grouping of your data.

Very nice introduction article for Cassandra Data Model.

The main problem with NoSQL databases is that, data modeling is completely different that relation database data modeling. Because we ask for a given key for some structured data, to achieve the best performance boost we should store data in proper way.

First at all NoSQL databases are not always better, so as always use the right tool to get job done without pain. We should decide do we need NoSQL database such as Cassandra. So … why we may want to use NoSQL solution:

  1. No single point of failure – the relational model is hard and expensive to be clustered (sequences, cascades, transactions, etc.), Oracle or MySQL database are focused on Consistency opposite to Cassandra (see CAP Theorem).
  2. Relational model theory is about normalization 1NF, 2NF, 3NF, and above, NoSQL make here difference, as we want to get all the needed data as single query we allow to duplicate data, so we are not structuring our data to be normalized, we want to structure our data for queries that are executed.
  3. With document store like Cassandra we have flexible scheme, so we may add and remove fields on the fly, this is huge pros as our deployment grows (hundreds of nodes)
  4. Most of setups with “normal” databases are Master (with mostly one “master” node) where writes operation goes, with Cassandra we have distributed writes so we can write data anywhere.

There are a lot of nice articles about installation of Cassandra, so I will just point them here:

As OSX user I’ve used last two, but highlighted link is worth to see as you can build Cassandra cluster.

To play a little bit with Cassandra we will use Cascal library which is hosted at GitHub. Cascal has pretty good documentation so if something is unclear refer to cascal wiki page. One additional important project is twissandra. It is example project to demonstrate how to use Cassandra, so to better understand Cassandra data model it is good to get that project and play with it a little bit.

The practical part is outdated, Cascal is outdated. Current list of drivers on cassandra planet page.

Summary

Cassandra is well-known for having no single point of failure, it is data storage for Facebook, Twitter and Digg. And what is most important here Cassandra has now Commercial support, so if your business don’t have time to learn and play with Cassandra. Now you may call Riptano. They provide Service and Training for Apache Cassandra.

Yes, Riptano is now called DataStax 🙂

Cassandra has now hers five minutes, and as we see she proves that it is worth to put some affords to learn NoSQL style data models. As always some of problems are ideal for rational data storage and some of them are typical for Cassandra it is good to have both tools in ours toolbox.

Meantime I was playing with Cassandra, Cassandra team have released version 0.6 and 0.6.1. The most important feature is Hadoop MapReduce support, there are also performance improvements with new caching layer. So as you see they moving fast :).

Yes they went really fast. I’ve played with 0.5.1 and now we have 2.1. I decided to publish this outdated post, but to make it interesting I’ve added Cassandra Time Machine.

References:

Cassandra Time Machine:

This  time machine shows us most important changes in major version of Cassandra. Despite of this changes there was a lot of bug fixes and improvements as well. I’ve done this time machine to appreciate work of Cassandra’s contributors. They did and still do great job!

0.6.x (2010)

The Cassandra’s team resolved 348 issues (part of them are port from 0.7x), there was thirteen releases. From 0.6.7 version all releases were bug fixing  ported form 0.7.x.

Features added:

  • Simple and very “stupid :)” Hadoop integration,
  • Dynamic endpoint snitch – “An endpoint snitch that automatically and dynamically infers “distance” to other machines without having to explicitly configure rack and datacenter positions solves two problems:”,
  • MX is accessible for none java client
  • Authorization and authentication (the beginning)
  • Per-keyspace replication factor (the beginning of replication strategy)
  • Row level cache
  • InProcessCassandraServer for testing purpose. Now it is replaced by EmbeddedCassandraService.
  • and many more minor features (ConsistencyLevel.ANY, ClusterProbe, Pretty-print column names, more JMX operations, global_snapshot and clear_global_snapshot commands, cleanup utility …)

0.7.x (2011)

This time they resolved 1006 issues and there was ten releases.

Features added:

  • Expiration time for column. Expired column acts as ‘markedForDelete’.
  • Configurable ‘merge factor’ for Column Families. MergeFactor attribute is used to tune read vs write performance for a ColumnFamily. A lower MergeFactor will cause compaction more frequently, leading to improved read performance at the cost of decreased write performance.
  • Allow creating indexes on existing data.
  • EC2Snitch – this snitch assumes  that EC2 region is a DC and  availability zone is a rack.
  • scrub command – rebuild sstables for one or more column family.
  • Removal operation which operates on key ranges and delete an entire columnfamily (truncate operation).
  • Weighted request scheduler.
  • and many more (access level for Thirft, many cassandra-cli improvements, NumericType  column comparator, support for Hadoop Streaming, cfhistograms, secondary indices for column families, JMX per node interface,

0.8.x (2011)

Last version before 1.0. Team resolved 549 issues and released ten versions.

Features added:

  • CQL (Cassandra Query Language) 1.0 language specification.
  • Idea of Coprocessors  (from hackathon) which was renamed to Plugins, which was implemented in 2.x as Triggers.
  • SeedProvider is pluggable via interface
  • Encryption support for internode communication (all, none).
  •  EC2 features for setting seeds and tokens (in EC2 machines die and bring up more frequently)
  • Compaction Throttling .
  • Support for batch insert/delete in CQL.
  • JDBC driver for CQL.
  • and many more (more commands in cli, different timeouts for different classes of operation, counter column support for SSTableExport, EndpointSnitchInfoMBean …)

1.0.x (2011)

First stable release. 510 issues were resolved, there was also twelve versions.

Features added:

  • SSTable compression – long waited feature (CASSANDRA-47). Most of the time it is good to exchange CPU for I/O.
  • Stream compression – Today we have Snappy, LZ4 and Deflate compression.
  • Checksum for compressed data to detect corrupted columns.
  • Better performance for rows with contains many (more than thousand) columns.
  • Max client timestamp for an SSTable is being captured and provided via SSTableMetadata.
  • Encryption for data across DC only.
  • Timing information to cassandra-cli queries – it looks like cosmetics, but is very handy.
  • Redesigned Compaction
  • CQL 1.1
  • and many more (RoundRobinScheduler metrics, overriding RING_DELAY, upgradesstables nodetool command, CQL improvements,  bloomfilter stats and memory size, ….)

1.1.x (2012)

One dot One line had twelve releases and the team resolved 620 issues.

Features added:

  • Concurrent Schema Migrations,
  • Prepared statements,
  • Infinite bootstrap – for new configuration testing purpose with live traffic. In this mode node would follow the bootstrap procedure as normal, but never fully join the ring.
  • Running Map/Reduce job with server side filtering.
  • Override of available processors value, so we can deploy multiple instances on single machine.
  • CompositeType comparator is now extendable.
  • Fine-grained control over data directories, so we can control what sstable are placed where.
  • Eagerly re-write data at read time.
  • Configurable transport in RecordReader and RecordWriter.
  • and many more (ALTER of Column Family attributes in CQL, Gossip goodbye command, loading from flat file, COPY TO command,  CQL support for describing keyspaces and column families,  rebuild index” JMX command , disable snapshots option,  “ALTER KEYSPACE” statement …)

1.2.x (2013)

This time Cassnadra team resolved 997 issues and released nineteen versions.

Features added:

  • Disallow client-provided timestamps, so WITH TIMESTAMP was ripped out.
  • Query tracing details – very helpful feature.
  • Ability to query collection types (list, set, and map) in CQL.
  • CQL 3.0 (better support for wide rows and generalization for composite columns, per-column family default consistency level)
  • Murmur3 partitioner which is faster then MD5.
  • Different timeout for reads and writes.
  • Atomic, eventually-consistent batches.
  • Compressed and off heap bloomfilters.
  • Global prepared statement instead of connection based.
  • Describe cluster for nodetool and cqlsh.
  • Metrics for native protocols and for global ColumnFamily.
  • Latency consistency analysis within nodetool. Users can accurately predict Cassandra’s behavior in their production environments without interfering with performance.
  • Custom CQL protocol and transport.
  • LZ4Compressor two time faster compression than Snappy.
  • LOCAL_ONE consistency level.
  • and many more (improved authentication logs, Multiple independent Level Compactions, UpgradeSSTables optimization, tombstone statistics in cfstats, ReverseCompaction,  resizing of threadpools via JMX, allow disabling the SlabAllocator, Notify before deleting SSTable…)

2.0.x (2013)

The 2.0 was released and here you find great Datastax article: What’s under the hood in Cassandra 2.0. There were ten versions and the team resolved 868 issues.

Features added:

  • Triggers – Asynchronous triggers is a basic mechanism to implement various use cases of asynchronous execution of application code at database side.
  • Query paging mechanism native CQL protocol.
  • Compare and Set Support (SET with IF statements).
  • Streaming 2.0.
  • Multiple ports to gossip from a single IP address this allow for multiple Cassandra service instances to run on a single machine, or from a group of machines behind a NAT.
  • CQL improvements.
  • Reduce Bloom Filter Garbage Allocation.
  • Network topology snitch supporting preferred addresses, so having cluster spanning multiple data centers, some in Amazon EC2 and some not is possible.
  • and many more ( index_interval configurable per column family, Single-pass compaction, Track sstable coldness, beforeChange Notification, CqlRecordReader, balance utility for vnodes, triggers LWT operations …..)

2.1.x (2014 during Cassandra Summit)

Two beta, six release candidate, 535 issues resolved. That’s great news. Datastax provided great articles about 2.1:

Final Word

Rafał provides great Cassandra Modeling Kata. It’s worth reading!

This post is original from April 2010. I’ve added citation to make some comments, and I’ve added Time Machine section (after orginal references). Currently the best place to start is DataStax Blog,  CassandraPlanet, and of course Twitter.

Some of futures where backported into previous version, that’s the reason they are in previous version (eg. CASSANDRA-5935 was fixed in 2.0.1 but also ported to 1.2.10 so it will show in 1.2.x).

DataStax in meantime became great company. They are behind Cassandra for many years providing stable and continuous growth. Currently valuation is around 830 million american dollars. If you have spear money I recommend you to invest in this company :). Last pre IPO round raised DataStax by $106 Million.