Sohan's Blog

Things I'm Learning

ORM or Not?

Yes to ORM, No to custom queries. Here’s a bullet list:

  1. You’d probably end up writing your own ORM anyway.
  2. Doing so, you’d spend a lot of money writing library level code instead of your business logic.
  3. The time tested ones are likely to be better than yours.
  4. Beware, you’d have to figure out some complex issues: connection pooling, concurrency/locking, caching, transactions, updates, versioning, documentation etc.

If you find yourself ORM is falling short of what you need, you’re either presenting information overload or using a transactional database as a reporting store.

In this article, I mean ORM to be an object database mapping, so it applies to both SQL/NoSQL databases.

Disagree? I’d like to find your opinion in the comments.

Development Environments and Dependency Hell

It used to be pretty straight forward to get up and running with a Rails app. You’d expect something like the following:

1
2
3
4
5
git clone git@blah/blah.git
cd blah
bundle
rake db:migrate
rails s

If the project is using rvm and bundler, the ruby versions and gem depdendencies are all taken care of. So far life is good this way.

But it starts getting complicated. For example, your project probably uses MySQL and no matter what, I can’t remeber all the c libraries that are pre-requisites for the mysql2 gem to actually install successfully. If it uses Nokogiri, get another hit of all the who-knows-what libxml2* libraries that need to be there. Another gem quite commonly used is rmagick with similar c depdencies. Every time I hit these road blocks, I feel so helpless for:

  • I have no idea what is required
  • Someone on StackOverflow has a magical solution
  • The solution will work on one unix distro but won’t on others

This starts getting even more complicated as you start adding external project depdendencies. For example, your project will probably need some Queueing, Caching, Emailing, SOA integration etc. And even if you have very good tests, it’s likely that you’d want to manually test to see if your project holds up when integrated to other projects. It’s best to have all these third-party products in your dev box for obvious reasons. But its also super hard to keep everything in sync, because

  • You’d have to automate the installation of all the third-party products in your dev box
  • Find a mechanism to update the third-party projects as they change

To tackle these problems, teams often use Chef/Puppet/Powershell/Vagrant etc. infrastructure automation tools. In practice, I’m yet to find a project where these tools would just work in a single pass. For example, it would miss one package or the other, sometimes it would fail at halfway, sometimes it would get you pretty close but almost never I have seen it to work on a single pass. I find this to be a recurring problem, faced by almost all dev teams.

One solution I think may work is as follows:

1. Teams from each project writes their own bootstrap scripts, so others can just use it
2. The bootstrap script runs on a CI server and fails a commit if it breaks the bootstrap
3. Nobody runs a manual command ever for bootstrapping the dev machines

I’m yet to try this approach in practice. But too many times I’ve seen there’s only one/couple people writing the bootstrapping scripts for all projects. Also, nobody really finds the issues until a new dev box needs bootstrapping. A CI server would be a big push to keep it green all the time. And finally, everytime a one-off command is run for bootstrapping, a little part of automation opportunity is missed.

If you’ve had success with these problems, please share. I’d really like to learn from you and hopefully implement some of your proven practices/principles.

Pair Programming vs. Code Reviews

Sync Dive Photo credits to http://womendiving.blogspot.ca/2012/08/2012-olympic-games-10m-synchronized.html

Pair programming is like synchronized swimming. You do it together, you win it together. You try to improve your chances by helping your partner, and vice verca. Even if you lose, its highly likely you’ve improved simply by being together sharing a united goal.

Code review can be fun and really useful too. But deep deep care needs to be taken to ensure it remains same in spirit, that is, the unified shared goal. In practice I have seen it to be super hard to achieve consistently.

Are you pair programming? Are you doing code reviews only? Why/why not? Would you change?

Comments are welcome.

Joined Sourcefire

Hello Dear Readers:

I just wanted to inform you that I’ve joined Sourcefire this morning as a Principal Software Engineer. My primary responsibility will be helping my team with Ruby on Rails and Web technology related expertise. I’m looking forward to this new role.

ThoughtWorks was a wonderful experience. In the last couple of years with ThoughtWorks, I have worked on 4 different projects, for 4 different clients and business domains from 4 different cities in North America. In a future post I would like to write about the fantastic journey, but for now I just wanted to share the news.

Hopefully, Sourcefire will be an amazing experience as well.

Thank you again for reading my blog.

Database Design: Sorting by Concepts on Nullable Fields

In a recent project, we had this requirement to sort a list of items by a concept that is absent in the database schema, but can be derived from other fields. To make it easy to understand, let’s build an example, simple enough to isolate the topic of interest.

Let’s say we have a list of locations, stored in a database table as follows:

Locations (location_id (primary key), province_id (NULL), city_id (NOT NULL))

And say, we want to present a list of locations sorted by “address”. As you see in this example, “address” is a concept, not directly present in the database, but can be derived based on province_id and city_id. So, when we say sort by “address”, the SQL query may look like the follwing:

query to sort by address
1
  SELECT * FROM Locations order by province_id, city_id

Now, if you take a second look at the schema, you’ll see the province_id is Nullable. And if you are familiar with SQL, you already know that a NULL does not equal another NULL. So, for rows with province_id = NULL, the secondary sort on city_id has no effect at all. As a result, the expected sort by “address” cannot be derived by using this simple order by clause.

There are a few work arounds to it. One obvious workaround is to change the DB schema and make the province_id NOT NULL in favor of a placeholder, for example, “Not Available”. This would eliminate the problem altogether. This should be pretty easy to achieve if you have control over the database schema.

But in case this is beyond your control, you’d have to hack your query and sleep with the noisy butterfly in your stomach. Such a poorman’s solution would be to put a "case ... when ... else... end..." in the query.

The reason I wanted to share this post is, if your project needs to sort by such a computed concept, here’s what you should probably:

  1. Say “No” to it.
  2. Change your database to incorporate the concept as a column in some table.
  3. At the very least, make sure you’re not sorting on a Nullable field.
  4. Repeat 1-3 in order.

Thanks for reading. Keep crafting good software.

MvcMailer New API

With the help of @TylerMercier, and many active users of MvcMailer, we have just released the new API for MvcMailer. This is a summary post capturing the work and lessons learned in the process.

The bulk of the work has been done on removing hard dependencies on dll files for 3rd party libraries in favor of NuGet packages. For example, we used NUnit, for running our tests. Instead of referencing the dlls directly, we are now using the NuGet package. This will help the contributors to get up and running with the source code without having to worry about the depedencies being in the right place.

NuGet command line also added a dependency as a Nuget package!

But this cleanup is not gonna be directly visible to the users of MvcMailer. If you install MvcMailer today, you should see a few changes as follows:

  1. MvcMailer now uses T4Scaffolding 1.0.7 instead of the older version that was causing issue with ASP.NET MVC 4.
  2. MvcMailer package is now exclusively for ASP.NET MVC 4. And a new package MvcMailer3 is published for ASP.NET MVC3. After looking into options, we found this was probably the best way to release an upgrade, while still remaining compatible with MVC3.
  3. The MailerBase has a sweet new API. The old API still works, but I’d highly discourage using it. This is how it’d look:
Example code showing old API
1
2
3
4
5
6
7
8
9
10
11
12
13
class WelcomeMailer : MailerBase{

  public MailMessage Welcome(){

    var mailMessage = new MailMessage(){Subject = 'Welcome to the world!'};
    mailMessage.To.Add("hello@example.com");

    PopulateBody(mailMessage, 'Welcome');

    return mailMessage;
  }

}

As you see here, the old API would require you to initialize your own MailMessage, set some parameters to it and then hand it over to MailerBase so it can render the view into its body.

We found it would be nice to reverse this workflow using lambdas. So, there’s a new populate method, that will call you back with an already instantiated MvcMailMessage object so you can set its properties as required. MvcMailMessage is an extension of MailMessage from the core .Net library, with new properties added so you can specify the ViewName, MasterName, LinkedResources etc.

So, with this new API, the code from above will look like the following:

Example code with the new API
1
2
3
4
5
6
7
8
9
10
11
12
13
class WelcomeMailer : MailerBase{

  public MvcMailMessage Welcome(){

    return Populate(x => {
      x.Subject = 'Welcome to the world!';
      x.ViewName = 'Welcome';
      x.To.Add("hello@example.com");
    });

  }

}

As you see here, the new API provides a nice way get rid of some of the repiting parts of your mailer code. This API is available on both MvcMailer3 and MvcMailer.

We’d like to hear your feedback on this new API. Thank you for using MvcMailer.

LoveJS Presentation at CAMUG

I paired with @TylerMercier as we did a little hands-on demo on writing testable and OO JavaScript.

Presenting with Tyler

This was mostly based on our pairing experience on the current project, but also had things that we learned from our previous web projects. I wanted to share some of the highlights of this session with my readers on this blog:

From experience, I have seen a few charactersitics that make JavaScript coding a real fun. Here’s a few that stand out:

Good JavaScript is Testable and Object Oriented

Really simple stuff, but it took me quite a few years to actually start writing OO JavaScript, and more so, to write automated tests that run on CI server. As I often say, the shittiest part of most web projects is its CSS, with its hacks, and the inherent nature of it. And JavaScript is often just as smelly. But if you just start writing OO JavaScript with tests, your code will drive you to a cleaner path simply because bad code makes testing really tough. And with Jasmine, testing JavaScript is super fun. So, we suggested the following toolset:

  1. CoffeeScript
  2. Jasmine
  3. PhantomJS
  4. CI integration

The materials related to this session are available online at github.

It was a great turnout, full house, engaged audience, and, asking really good questions. I had a lot of fun doing the hands-on. You know live demos can be daunting, especially when you are writing on a low res screen so people can actually read the projected text. But thanks a lot to Tyler, we had done a couple dry runs of the demo. That helped us a lot in terms of finding the points to transition between us, as well as some muscle memory as we were writing the code live in front of the audience.

But, I also wanted to share a checklist that we should have done as a preparation for this talk. It’s too bad, we didn’t do the following, hopefully this will be better next time. Here’s what I have in my mind:

  • Audio/video record the session. So, I can see myself afterwards and find some obvious rooms for improvement.
  • Practice using real projector so we can find the best theme for the room size. This time I felt some of the text colors weren’t as readable for the back benchers.
  • Never plan for a 50 minute talk when you have a 60 minute slot. Leave at least 20 minutes slack. Have a backup plan for an extension in case you are done early. It sucks to cut short or rush to the ending.

But in retrospect, I was super impressed with how the session ran. Hopefully, with practice it will only keep getting better.

Deploying a Java Application Using Capistrano

In our current project, we have a little Java Bridge, that reads JMS messages from an enterprise service bus (ESB) and pipes it to an ActiveMQ topic. This had to be written in Java, or some flavor of it like JRuby/Scala, because the ESB only allows their signed client libraries to interact with their API.

Since, we were already using Capistrano to deploy our Rails/Nginx app, we wanted to leverage the same tool for deploying the Java Bridge. This would also make maintenance easier, since the same standards is used for deploy locations/log files/roations etc.

Design of a MongoDB Analytics Database

We built a MongoDB based realtime analytics solution and learned a few things in the process. I would like to share some of our learnings in this post.

When tasked to find a good database for holding realtime analytics data, I was searching through the wisdom of internet to come up with a good choice. At a high level, the database needed to deliver the following:

  1. Atomic counters so that events could be counted fast. For example, count up the Total Sales metric for each sale event.
  2. Unique sets. For example, unique set of buyers who bought anything today.
  3. Simple lookups. For example, when a sale happens, it needs to lookup the original price to calculate the profit on the fly.
  4. Indexing on multiple fields. For example, a product lookup may be by its ID as well as by a combination of other fields.
  5. Fast at doing 1-4, so it can keep up with the event stream in realtime.

It became pretty clear that, the internet was favoring either Redis or MongoDB for such requirements. I did a little spike using both, and settled down on MongoDB. This was becauase, MongoDB had better support for performing lookups and indexing. Also, the document oriented nature of MongoDB suited well with the Object Oriented models, leaning towards a more direct match between the two worlds. Redis is awesome for doing increments and set operations, but it would require some application level handling of OO -> DB interfacing as well as intelligent lookups. So, MongoDB was the winner for the job.

If you’ve watched the video at the mongodb blog you know it’s quite straight forward to design a document for holding realtime analytics data.

The key design goal is: precopute the metrics in the required aggregation levels.

This design goal is applicable to pretty much all reporting/dashboarding projects. Since it’s common to aggregate different types of data on a dashboard, without precomputing the results, it would test anyone’s patience before the report is ready. In our case, the dashboard mostly displays data for the day in realtime. So, a simplified text only view would display the following:

###Sunday, August 5, 2012

* Total Sold: 2,500
* Total Profit: $ 1,345,427
* Proft Margin: 5.09%
* Buyers: 311
* Sellers: 126

####Online

* Total Sold: 400
* Total Profit: $ 98,762
* Proft Margin: 7.1%
* Buyers: 114
* Sellers: 28

####In Stores

* Total Sold: 2,100
* Total Profit: $ 124,666,5
* Proft Margin: 4.91%
* Buyers: 257
* Sellers: 110

As discussed above, for a faster (and sane!) response time, the data better be precomputed in the database. So, to hold this precomputed data, we have designed a document like the following example:

metric: {
  event_date: '2010/08/05',
  channel: 'online',
  total_sold: 2100,
  total_buying_price: 112,567,432,
  total_selling_price: 123,875,127,
  buyers: ['buyer 1', 'buyer 2'],
  sellers: ['seller 1', 'seller 2']
}

And there is a document like this for each channel, for each day. This simple document is suitable for producing the desired report. Of course, in the real project, we need more data points in the report as well as aggregation levels beyond just a day and a channel. And you guessed it right, we have a document like this example one for each required aggregation level.

This design has held pretty well for our project. This is simple, there’s a direct match between what is shown and what is stored. The data is already grouped, so queries simply fetch the documents without needing to perform any major computation on the fly. This model makes it trivial to add a new data point, as well as a new reporting aggregation level.

But the challenge is to make sure we can in fact keep a document like this up-to-date as events happen in realtime without falling behind. Let’s talk about it in the upcoming post.

If you haven’t had a chance to read the previous post in this series, here’s a link to Deploying to TV Screens

Deploying to TV Screens

Off late, I am working on a project to render real time business data with interesting visualizations, so people can feel the pulse of the business. For the last couple of months, I have been planning to write a detailed post about it. But after a few false starts, I am finally settling on smaller posts, telling a small part of the story each time.

So, have you ever worked on a web application that is primarily viewed through 55”+ 1080p TV screens?

We are showing real time business data, aggregated from multiple data sources as they are happening. The screens are gonna be mounted on the wall, like as you’d see in the airports. And it needs to be running 24x7

This introduces an interesting deployment challenge:

How would you reload the screen every time you re-deploy the app?

A regular web app is interactive. So, when we re-deploy the app, the users typically get the latest version as soon as they reload the page or navigate through the app. However, in an airport like setting, where information is displayed across many screens, and typically no-one is clicking it, the app needs to be aware of updates and reload itself to achieve the same. This is essential, for example, if there’s an API change on the server side, the HTML/Javascript/CSS must be in sync to be able render it.

Airport displays

Photo credits to 5mal5

The app itself uses JSON API calls to render the live data. Each screen is somewhat like a single page app, using multiple AJAX calls to render different parts of the screen, showing different data. The API calls are all funneled through a single Javascript module. The module looks like the following (showing a simplified version for brevity):

If you have noticed here, there’s an extra check inside the success callback. To begin with, the page remembers the server token on reload. So, whenever there’s a new token, it refreshes the page. Since all API calls are funneled through this module, this becomes a no-brainer to support new screens/API calls.

Our API’s respond with a server token, which is guaranteed to:

  1. Remain same for each server deployment, and
  2. Change whenever there’s a new deployment.

However, we still need to make sure the server token indeed ensures these two essential properties. With a little trick, this becomes trivial. For our app, we are using Capistrano to deploy our Ruby on Rails project. For those new to Capistrano, it uses a timestamped directory for each deploy, symlinking it to the current. So, it looks somewhat like the following:

Every Ruby on Rails app also comes with a little method, Rails.root that returns the full path to the directory of the current deployment. So, in this example scenario, we get the following:

Rails.root #==> /app/realtime/releases/20120729083021

Since every deployment will be a new timestamp, this method ensures a unique token for each deployment. That’s all we need for the api module to be aware of new deployments and auto refreshes. Here’s an example controller/action (again, simplified for brevity):

I liked the organic nature of this technique. It is harvesting on the available tools. Although the examples in this post show Ruby/Rails as an example, I am sure the same techniques can be applied to other technologies with the same simplicity.

Before I conclude, I would share one limitation of this technique here. Since the page reload happens on a shared api module, the reload needs to be generic, without requiring any special knowledge about the pages to reload. This pretty much means, a page needs to be able to reconstruct itself entirely from it’s URL. Requiring any Javascript state beyond the URL, would probably require API specific handling to reload, killing the advantage of this technique. But the good news is, its always a good practice to rely solely on the URL to construct a page.

Thanks a ton if you’ve followed all the way. Stay tuned for the upcoming posts, where I will be telling the story of handling multiple API calls on a page, highlighting data changes and some other interesting bits about a real time dashboard.