Most of the products start with just a handful of simple use-cases. Systems starts with simple architecture, with a bunch of tables and a framework built upon standard design pattern like MVC.
We choose Ruby-on-Rails as the framework, and built our system around rails conventions such as:
- Thin controller, thick model. Business logic resides in models, controller only acts as an external entry point.
- Rails-styled html forms (using form_for, etc helpers), which are easily consumed by model methods (save, update_attribute).
- Controller would choose view to present depending upon context (logged in user’s role, etc) – resource being accessed/transacted upon (model) remains immune to end user.
The problem
Fast forward to a few years. We launched many business verticals, many features were added, deprecated, iterated and experimented upon. We started to observe some shortcoming to the standard approach:
- Since a model contains all business logic, and there can be numerous entry points to it, a lot of logic and housekeeping code had to be written in model itself – oftenly in callbacks since the callbacks are automatically fired whenever a change happens/about to happen (create/update/delete).
- Callbacks are part of data object life cycle. However, since we have business logic written as well – both started to intermix. Code for business logic started to become aware of state of model object (dirty, new, etc.), transaction scope, etc. Consider an example where payment installments are changed, when few installments are already paid. Here, create/update/delete callback of each
payment
object would need to know about other (dirty) payment objects for doing validations & adjustments, etc. Since that’s not feasible in a sane way, the less insane way would be to move all the logic to parent model, and somehow group all triggers into a single update. - Model relationship became complex, and complexity of callbacks increased even more. Consider a simple example where only a partial payment is made against an installment. New installment needs to be created from the balance amount, and the status of parent quotation might change to ‘partially paid’. In terms of code, after/before update callback of installment would need to create a new sibling installment and, like the previous example, update it’s parent model
quote
, which would fire subsequent callbacks. And, depending upon your schema, the relationship might not be as simple asQuote has_many :payments
. - Models representing bookings, quotations, etc became larger and contained increasingly complex validations and processing. We were still utilising transaction scope of
save
orupdate_attributes
method(s) to maintain data consistency. However, on a simplest code/data failure the entire transaction would rollback and and user will have to start all over again. Due to various product/code complexities, refilling form with previously entered data became increasingly complex and undesirable. - We have been using
exception_notifier
to send us a mail in case of server errors which seems to have a critical shortcoming that exceptions inafter_commit
callbacks die silently. Secondly, the emails system wasn’t scaleable anyway (mailbox would flood during an outage) and probability of a critical but less-frequent issue being overlooked remain high. As a result, we didn’t had any measure of stability of critical code written in callbacks. - We started to notice limitations in managing configurations. Over time, a lot of such interfaces came into existence, each had own UI with different level of intuitiveness, behaviour, guidelines and nuances. The heterogeneous spaghetti of CM systems lead to further problems:
- Each new interface would require a separate internal training.
- Overlapping, conflicting or dependent configurations would get spread across multiple interfaces. This often resulted in knowledge gaps and broken system.
- Since an end user could only change “values”, changing simple conditions needed full release cycle, consuming bandwidth of product, development and quality teams. For e.g., let’s consider a trivial example where definition of ‘high-intent’ needs to be changed from “leads having hotel preferences filled” to “leads having hotel & sightseeing preferences filled”:
- A product manager would understand the requirement, and prioritize it in one of the upcoming sprints.
- A developer would dig into the code, find the place(s) where the logic has been put, change it and test it.
- A qa engineer would list down all entry points (lead creation, preference change, etc), test the change the test all the entry point with an assortment of input combinations, and finally amend the automation cases to include the change.
From start-to-end, the entire process would take 2-4 weeks!
While issues listed earlier issues were due to hastily written code or simply how the system evolved over time, the configuration management had to be seen in wider context: ability to dynamically configure workflows, not just some values/conditions. Realtime health metrics, standardised interface for analytics, etc. needed to be baked-in as well!
The solution
We knew that the system needs to move in a direction where is more declarative, flexible and decoupled. And hence, with the objective of solving current problems in a wider context, we decided to move towards event-driven architecture, which can be visualized as:
The flow control is simple:
- User interacts with the system (e.g. puts up request for packages, make changes to package requirements, etc.)
- Changes are persisted post basic validation. The system figures out if business state of system (i.e. trip request needs to be forwarded to partner agents, etc.) has been altered.
- If so, system initiates a corresponding event with all the necessary context. For e.g., “trip_created” event with trip_id, user_ip, etc:
{
event_name: :trip_created_event,
context:
{
type: Trip,
id: 1234567,
},
data:
{
user_id: 785634,
email: awesome@eda.com,
created_from_ip: "172.0.0.1",
preferences: { … }
}
}
- Event router acknowledges the events, retrieves handlers and invokes them in sequential, parallel or a hybrid order. Few handlers might be configured statically (hard-coded):
handlers: {
trip_created_event:
[
"Eda::Handlers::Trip::PostCreateEmail",
"Eda::Handlers::Instrumentation::Trip::Create"
]
}
- Other handlers are determined by invoking rule engine; since we already had a DSL-based rule engine, we went a step further with some modification to rule engine so that it can configure handlers based on rules!
To explain the above workflow, let’s walkthrough an example: “Send diwali discount email after 30 minutes if traveler fills necessary preferences (hotels, budgets, etc) for a trip created during October 15-20th 2017”. This can be realized with following configuration and control flow:
Rule configuration
An admin within TT would configure notification_handler with trip_created_event ruleset via rule engine interface with following conditions:
- Trip has been created between October 15-17, 2017
- Hotel preference has been specified
If this rule evaluates as ‘success’, rule engine’s return value with notification_service of type ‘handler’. Clicking on Options
allows setting values for the handler:
- Template to use for sending email. We already had a template system for editing email content directly from a CMS; only template name needs to be configured here.
- Sending delay, ’30 min’ in this case.
In technical terms, this is how the configuration is represented via rule DSL:
{
"condition_set": [
{
"operator": "AND",
"conditions": [
{
"field": {
"key": "requested_trip.created_at"
},
"operator": "BETWEEN",
"value": {
"key": [
"2017-10-15",
"2017-10-20"
}
},
{
"field": {
"key": "preferences.hotels",
},
"operator": "EXISTS"
}
]
}
],
"success_action_set": {
"Handler": [
{
"name": "notification_service",
"metadata": {
"template_name": {
"type": "String",
"value": "diwali_mailer"
},
"sending_delay": {
"type": "Integer",
"value": "30"
}
}
}
]
},
}
Event: Trip creation
A traveler then creates a trip. In rails context, an object of RequestedTrip
model gets created and saved. Rails provides an ability to configure ‘hooks’ in object lifecycle via callbacks. We added eda (event-driven architecture) system entry-point as an after_commit callback, using some meta-programming elegance!
class RequestedTrip < ActiveRecord::Base
emit_events_on :create, :update
end
After some processing, the hook triggers an event:
def trigger
Eda::Client.log :info, { event_name: event_name }, 'About to trigger event'
router_class_name.dispatch(event_name,payload)
end
The event router receives the event, and retrieves both type of handlers:
static_handler = event_registry.handlers(event.event_name)
rule_handlers = RuleSystemClient.new(event.data, event.event_name).fetch
Executor.new(event, static_handler + rule_handlers).execute
The executor class executes all the handlers (in sequence, as of now):
def execute
handlers.each do |handler|
m = { measurement: MEASUREMENT, tags: lambda { |res| { event: event.event_name, handler: handler.handler_name, success: res} }, values: {} }
Eda::Client.push_metric_with_benchmark(m) do
begin
handler.execute(event)
true
rescue => e
Eda::Client.handle_exception(e)
false
end
end
end
end
The executor nicely records handler success/failure metrics in our metrics and monitoring system, giving the the clear view of system stability & performance. The visualization speaks for itself:
PS: If you want to understand our metrics & monitoring system which generates stunning visualisations like this, check out part I & part II.
Impact
The immediate impact which we noticed was low-turnaround time required for simple changes. One of the first use-case of this architecture was configuring an automatic phone call to our customers who have created a trip. An IVR handler needed to be written to trigger a third-party API for initiating phone call and it would take two parameters: call workflow id (as “destination missing” would have a separate content in call vs “hotel preferences missing”) & trigger delay (in case we don’t want to trigger a call immediately).
Now, to configure IVR handler, we would simply go to rule engine, and configure a condition (i.e. destination EXISTS) and as a return value, configure IVRHandler with workflow_id and delay as parameters! One can tweak any of these parameters (i.e. create new workflow for new conditions, tweak trigger delay, etc), or event remove the IVR handler without needing any development cycles!
Learnings & pitfalls
The power of this architecture lies in proper design of events and handlers. Not all changes would have a ‘substantial’ impact on the system and hence only a handful of changes should be modelled as ‘events’. Not adhering to this principle would eventually bloat list of events & handlers and it would be hard to maintain stability.
Similarly for handlers, the power lies in flexibility. However, handler should expose only what is needed to be configurable, and not every single value. In previous example, for notification_service handler ‘sending_delay’ is configurable, however, recipient address isn’t (which will always be creator of trip).
Lastly, rule engine DSL should not be designed as a replacement for code itself, and more complex conditions should be pre-computed and simply exposed as keys. Let’s tweak the example above:
“Send diwali discount email after 30 minutes if traveler requests at least 4-star hotels and visiting minimum 2 cities for a trip created during October 15-20th 2017”
Here, the cities_count
should be pre-computed, and rule DSL would have a condition as cities_count >= 2
.
The road ahead
So far, we have barely scratched the surface of the power of this architecture. For instance:
- The trigger for this system is still
ActiveRecord
callbacks. Eventually, the event router would be decoupled from active record object lifecycle and event dispatch would be explicit, making code more intuitive and readable. - Event handles need to be re-written in a more modular fashion, so that event processing system could be wrapped as an independent service of it’s own, working with REST APIs and central messaging bus. The code needs to be logically decoupled so that handlers could be executed in parallel, or can follow a dependency tree.
- On top of code changes, product changes would be needed to handle temporary in consistency of data. For example, if invoice creation fails – the product need to “gracefully” inform the user about it and how retrial status, etc.
Comments