ETL lessons learnt

The search team at forward does a lot of ETL, data is our daily business.
Recently I wrote a script that:

  1. Collects some data from our Hadoop cluster
  2. Calls a 3rd party API via HTTP
  3. Pools the 3rd party API waiting for the requests to be processed
  4. Downloads and stores locally the result of the 3rd party call

I want now to share what I learnt writing this script.

Drop OOD, think functional
I didn’t wrote a single domain object, I used just hashes, they comes from Hive, they get used to call the external API, the xml that comes from the API becomes an hash, the data gathered from there goes straight in Mysql as CSV.

I wrote the script in ruby, but I wrote few functions, the application is Stateless but stageful.
There’s no state but every single stage status is saved in mongo.
In this way the script can fail at any point but always recover in a consistent state and start again from where it did stop.
Nokogiri does crash for a bug every noun when parsing the API request page where I get the current status of the 3rd party processing.
If anything can go wrong, it will
I didn’t care of fixing that bug or understanding why it does crash, the script will recover and try again till the parsing is successful.
I don’t need to be fast, cos the 3rd party server is rather slow in processing our requests. Speed is not a requirement.
Consistency is.
The 3rd party server went down as well quite few times, for networking issues and load.
I just keep polling and let the script fail on timeout, it will start again in a 10 seconds and try again.

Forget about REST, the important stuff is the rest
Data is the important bit, not the way you get it. The URL provided by the 3rd party is not rest, so what?
What I care about is the Data that they give to us. Being rest or not won’t change my life, the value is in the content not in the transport.
RTFM
Read the fucking manual.
I actually spent more time tuning/troubleshooting MySql (where I store the data) rather than writing the script, that deserves a full separate post, but the point is: read the manual, read the manual page till the end.
Drop frameworks
I use Sequel for the DB connectivity but I just use it to efficiently get the connection to the DB, I actually use plain SQL so that at least I know what’s going on, when you start having tables with 80 Millions rows you better start being careful about what’s going on.

In conclusion, the funny bit is that web application are ETL too :-)
What I wrote is valid for writing web applications too.

The wikipedia page says:

The typical real-life ETL cycle consists of the following execution steps:

Cycle initiation
Build reference data (think about populating drop downs, static data)
Extract (from sources) (user input)
Validate (validate user input)
Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) (business logic)
Stage (load into staging tables, if used) (store)
Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) (analytics, audit)
Publish (to target tables)
Archive
Clean up

The way we design our software these days

Rory Gibson wrote a nice write up of Fred’s open space talk that did happen last Monday at the 10th XPDay in London.

Jamie wrote a comment on that post that deserves a reply.

He says:

You won’t know it works until later… What debt are they now in? What happens when complexity grows? How do they resolve conflict? Many start ups work because debt is hidden. If peer pressure is their number 1 hr strategy they will stagnate and not be able to respond to changes in their business.

The comment was mainly on our philosophy, another commenter on the blog post describes it as JFDI.

During my talk at the Agile Day I’ve got a comment that I forgot to mention in my post-talk write up.
One guy on the audience said, well you guys just design your software with a Unix Philosophy.

Oh man, this is so true.
It’s even on wikipedia.

I particularly like theMike Gancarz check list of the Unix Philosophy:

  • Small is beautiful.
  • Make each program do one thing well.
  • Build a prototype as soon as possible.
  • Choose portability over efficiency.
  • Store data in flat text files.
  • Use software leverage to your advantage.
  • Use shell scripts to increase leverage and portability.
  • Avoid captive user interfaces.
  • Make every program a filter.

Then let me paste here also the golden rules from Eric Raymond:

  • Rule of Modularity: Write simple parts connected by clean interfaces.
  • Rule of Clarity: Clarity is better than cleverness.
  • Rule of Composition: Design programs to be connected to other programs.
  • Rule of Separation: Separate policy from mechanism; separate interfaces from engines.
  • Rule of Simplicity: Design for simplicity; add complexity only where you must.
  • Rule of Parsimony: Write a big program only when it is clear by demonstration that nothing else will do.
  • Rule of Transparency: Design for visibility to make inspection and debugging easier.
  • Rule of Robustness: Robustness is the child of transparency and simplicity.
  • Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
  • Rule of Least Surprise: In interface design, always do the least surprising thing.
  • Rule of Silence: When a program has nothing surprising to say, it should say nothing.
  • Rule of Repair: When you must fail, fail noisily and as soon as possible.
  • Rule of Economy: Programmer time is expensive; conserve it in preference to machine time.
  • Rule of Generation: Avoid hand-hacking; write programs to write programs when you can.
  • Rule of OptimizationPrototype before polishing. Get it working before you optimize it.
  • Rule of Diversity: Distrust all claims for “one true way”.
  • Rule of Extensibility: Design for the future, because it will be here sooner than you think.

Now believe or not but you can trash away lots of books just following these rules.
Unix is, IMO, the most stable software humanity ever wrote (think about unix, linux, macos and so on)

This is the way we work.
We do have technical debt and we have plans to rewrite our small applications, we do that continuously, since they are small, modular, with a single responsibility (think about cat, ack, grep, etc… do) it’s never a big deal.

The way we write software these days

In my previous blog post I did write about most of the conversations and feedbacks I’ve got after the Italian Agile Day.
In this one I want to address the software design.

Rewrite preferred over Refactoring

As I said during the speech we don’t do that much refactoring anymore and we rather throw away the code and rewrite it from scratch.
Paolo Polce said more or less something like this:

“You guys are good and write good enough code already, that’s why you need less (or none) refactoring”

It’s probably true, he gave me a good explanation, from a scale from zero to ten we probably write code that is already a six or a seven, so there’s little need for refactoring.
I’ve to say that we do refactor a bit, especially when a new feature comes in or when we re-open the code base after a week or more.
The big difference is that if we don’t like it at all anymore, if there are some code smells, if the code is resistant to change we just rewrite it.
Also, one type of refactoring we do often is to reduce the codebase size.
I truly believe that the main goal of refactoring should be keep the codebase small.
I had a chat with Mike the other day and I found that he’s doing the same thing in his team, he keeps trying to keep a part of his code base under 200 lines, even if they are adding new features.
I used to be obsessed in writing small classes and small methods. Let me tell you, that’s just nothing compared to writing small modular applications.
Also, small classes and short methods too often imply huge codebases: it’s hard to understand the intent of a system when its intent is scattered through hundred of files.

Bounded Contexts

We do use bounded contexts and this helps to keep the applications simple, easy to change, to rewrite when needed.
It does allow us to use the right tool for the job (picking up node.js or clojure or plain ruby) for the task.

Aggregation

Talking with ziobrando after the conference I’ve realized that we implement in most of the projects aggregates, as DDD defines them:

Cluster the Entities and Value Objects into Aggregates and define boundaries around each. Choose one Entity to be the root of each Aggregate, and control all access to the objects inside the boundary through the root. Allow external objects to hold references to root only. Transient references to the internal members can be passed out for use within a single operation only.

Using noSQL helps a lot. But we implement it also with mysql.
The key is to throw away all the frameworks and the patterns that dominated the market of the last 5/6 years: we don’t use Sequel Models, we don’t use (rails) Active Records. Most of the time we don’t even write domain objects, we just use hashes.
In functional programming this is definitely easier to achieve, however, in ruby you can obtain pretty good results as well.

Blue Green Deployment

The way we deploy our code to production has been well explained by Martin Fowler in this blog post.
I admit I didn’t know the name of this technique, I was just using it. Thanks again to ziobrando for suggesting me the name!

DSLs preferred over Patterns
As I said in the presentation I can’t remember the last time in the last six month that I did introduce or use a pattern in my code.
We do use tiny frameworks and DSLs rather than patterns. I think that this is the way to go.
Sinatra is a brilliant dsl for writing web applications, haml is a lovely dsl for writing html, sass is a brilliant dsl to write css, capistrano, rake… It’s all about dsls.

Agiler at forward and successful at the #IAD10

I am just back in London and I am full of notes, thoughts and comments on the presentation I gave at the 7th Italian Agile Day.

To start with I want to thank all the guys at Forward that created such an environment, they are in the credits at the end of the presentation, but for this blog post I want to put them first.

Secondly, I want to say thank you to all the audience, I never had such a brilliant audience, the room was full and I struggled to finish the presentation cos there were something like ten hands up or more.
And the conversation didn’t finish there, outside and for the whole day I did talk with so many great developers who gave me so many insights that I am now struggling to remember all of them.
And that’s the reason of this blog post.
To give you an opportunity to better understand my presentation (especially if you weren’t there) and to write what I learned after it.

Agile is now mainstream, with all the consequences

Nusco gave us an awesome keynote, at a certain point he did quote the original paper of the waterfall methodology which sounded weirdly similar to the agile one.
I did check and he was right, not only.
I saw the waterfall diagrams and now I remember where I saw them the first time.
Craig Larman showed them to the audience, at the Italian Java Conference. It must have been 2004 or 2005. I can’t remember.

I love the part on the paper where the author admits:

I believe in this concept, but the implementation described above is risky and invites failure.

The point is that since Agile is becoming mainstream it’s getting polluted by certifications, labels, zealots, people reading and learning about agile but forgetting that implementation can be risky and lead to fail, same as waterfall.

My presentation tried to explain that you should use just the right tools for the context you are in and avoid using the full stack just because you heard of it.

As ziobrando put it on twitter: ” Great insights from @javame yesterday evening. Funny to see how contexts put dogmas in perspective.”

emadb did tweet something like: @javame is a revolutionary, we should reflect on his good session, a lot.

I have to say that it’s pretty sad to be called a revolutionary in 2010, as I said before, we are just following the Agile Manifesto. So, what’s going on? What happened to Agile?

Why Agile these days does mean following books and papers over individuals?
Why Agile in these days does mean writing tons of tests rather than delivering working software?
Why Agile these days does mean strictly following a set of practises rather than responding to change?
Why Agile these days does use external business analysts rather than direct customer collaboration?

These are the questions to reflect on. I don’t want to be a revolutionary, we just follow the manifesto, but so many teams lost the original plot these days.

Agile is not dead, agile now needs to prove that he can survive in a mainstream/enterprise context without getting polluted.

context is king

I’ve got quite some constructive comments (online and offline) on my blog post about reducing feedback loops.

So I thought about collecting all the feedbacks here in this post to better explain my position on the topic.

There are some contexts were that “methodology” I did explain, let’s call it process 0.01%, won’t clearly work.

Now, I still don’t fully understand why, but it’s a fact. (many times is caused by politics and fear in dysfunctional organizations)

There are enterprise contexts where there’s need of tools to keep the project on track from a quality and delivery perspective. It’s still questionable that the tool will fix the problem.
I think that spending a month training/staffing the team most of the times will pay back more than installing (buying) CI server, tracking tool, estimating, planning, etc.

A common mistake in those enterprise contexts is that the enterprise manager will always ask for perfection. I sow so many good websites done by startups with little bugs still providing an excellent service.
Foursquare went down, people are still using it.
Facebook has many bugs but it’s a success story.

My understanding is that it’s more important to go live with an enough good system most of the times rather than having a perfect system frozen on your source code repository.

If I would have to work for a Nuclear Power Station control system or if I would have to write the software to control a surgery robot I would probably write lots and lots of tests or I will maybe, use a language that supports formal methods.

In the hippie funky world of the web small mistakes are allowed, building software for space rockets would be fun as well I guess, but that’s another context.

The sad thing is that lots and lost of former startups with the help of shit consultancy companies such as IBM, Capgemini, Accenture; to mention few, became enterprise-like.

So my message is, think as a startup as much as you can, always, it’s more fun for your employees and you will be faster to market.

think different, think nosql

I just came across this post from the high scalability blog.

I just want to say what the nosql movement gave me back.

We just finished writing a new application which visualizes pseudo realtime analytics information, it’s a web application, most of our analytics data is stored in hadoop, for this web app we decided to use mysql.

Hive wouldn’t be appropriate given its response time to execute a query (and lots of map reduce indeed), so we do get the data from hdfs on an hourly basis and store it in mysql.

After a quick spike on hbase we picked up mysql because we needed group by semantics, but at the same time we started using mysql in a different way, we don’t have any relations, we don’t do joins, we store our data as in a big table.
So basically we use the best of both words: no relations but sql.

This db grows pretty fast, the application has been running fully operational live for less than a day now and the total db size (two tables) it’s 48GB.

Then, for once, I came out with a good idea, we split the big table in a dozen of other smaller tables, basically the analytics source regions, what before was a column and a where clause in our queries became just a suffix of the original table.

I’ve been strongly inspired by the way hadoop manages partitions.

In conclusion that’s what the noSQL movement gave me back: he made me think differently to common data problems.