Architect Newsletter
spacer
spacer

Event driven architecture onto the Azure Services Platform

Summary

Thanks to the Azure Services Platform each and every architect has an almost infinite amount of storage and compute power at his disposal without any large upfront investments. Together with these major advantages however, also come a lot of design challenges that will change the way we design software.

In this article, I will guide you through this new environment and point out some of these design challenges that the cloud presents to us. I will also propose an architectural style, and some additional guidance, that can be used to overcome many of these challenges. Furthermore I'll give you an overview of the tools offered by the Azure cloud platform that can be used to implement such a system.

So, without any further delay, let's dive into the design considerations that you have to take into account when architecting applications for this infinitely scalable world.

Design Considerations

The impact of scale

In order to achieve this illusion of infinite storage and compute power, providing massive scalability and availability, the Windows Azure platform provides an abstraction layer on top of the machines supporting it, called the Windows Azure Fabric. This Fabric abstracts away infrastructural aspects like routing and storage and, each facet of this abstraction is called an overlay. As such, an overlay is intended to give you the idea that there is only one logical machine that takes care of storage or routing, but in reality there might be thousands of machines supporting these overlays.

Each overlay ensures that all of the information is automatically replicated over multiple machines in different "units of failure". For example the data overlay, responsible for the storage facilities like table storage, queue storage and blob storage, will replicate all of the data over to multiple of its peers. These peers are chosen in a way that reduces the probability that they all go down at the same time, by picking machines from different units of failure. Failure units are comprised of machines in different racks or even different data centers to limit the odds of all machines going down at once. The very nature of these overlays will thus distribute state over a large scale, and as data needs to travel over a wire, latency can become quite high, depending on the size of the updates obviously. To cope with this, your architecture should take into consideration that the data it is reading might be stale.

Another undesirable side effect of this potentially slow replication is the fact that for each update of information, multiple machines are involved. In an on-premise environment, you are used to having a distributed transaction coordinator orchestrate the communication between different data sources, using the 2-phase commit protocol, in order to maintain a transactional scope between these sources. Even at the web service level you might have used the WS-Transaction protocol to gain a consistent agreement on the outcome of the activities executed by these services. All of these protocols rely on an agreement between all of the transactional resources involved, and the more resources that are involved the longer it will take to reach such an agreement. This makes distributed transactions on a large scale really slow and fragile. In fact they are so fragile that Microsoft decided to provide no support for distributed transactions at all in the Azure platform. The absence of distributed transactions has a tremendous impact on the kind of architecture that you can use in the cloud; indeed any architectural paradigm that relies on distributed transactions should be reconsidered. Local transactions are still supported though, so you can have some scope of serialization when dealing with cloud enabled databases such as SQL Data Services.

Global platform

Besides the impact of massive scale, there is another important aspect to the Azure Platform: it's a global platform. This means that it is being used 24/7, 7 days a week. The fact that the cloud should be always operational gives the (automated) operators no real window to perform maintenance tasks. As such they are forced to take some of the machines down during operational hours, and thus your system as well, to apply patches and update. This means that you cannot predict at what time any part of your system might become unavailable, as you can in a regular enterprise environment.

There is an optimization in place though, that ensures availability of a part of your system if you have multiple instances of the same role deployed, which you are supposed to do anyway to take advantage of the cloud platform. Multiple instances of the same role are always assigned to different failure units, and failure units are upgraded one at a time. The same goes for upgrades to your own software and data; there is no real good window to take everything down if you want to ensure 100% availability of your system. The only option you have is to perform a rolling upgrade and take parts of your system down. In fact this issue of temporal unavailability isn't only a problem in the cloud; it manifests itself in any system that has to be available all the time, no matter whether it is on-premise or in the cloud.

But what exactly is the problem with most of the commonly used architectures today? Usually they are based on direct communication patterns such as synchronous request-reply. Think of the majority of traditional client-server, n-tier with web services or even composite service oriented architectures (SOA) that you know of, and you will see that these are quite often implemented with synchronous communication patterns. Synchronous communication requires both the involved caller and callee to be up-and-running during the time frame of the communication. But as described above, parts of your system might be taken down during the operational hours.

Your architecture has to ensure the overall availability of your system through the application of asynchronous communication patterns. Just like any of the large systems architected in the past 20 years had to do, really. Now let's have a look at such an architectural style, which can be used to deal with this temporal unavailability, and how this kind of architecture can be build on top of the Azure platform.

Event Driven Architecture

The basic idea behind an event driven architecture (EDA) is inversion of communication. In contrast to the direct communication frequently used in a composite SOA, or other architectures, communications is done asynchronously through publishing events. An EDA also consists of services, just like a regular SOA is, but in order to implement a business process these services will not call each other; they are not being composed. Instead a service is publishing events, to which other services might or might not subscribe and will optionally react to these events by performing a part of the business process.

The publishing service has no idea about which other services are dependent of it, and if the events are transported via a durable medium the dependent service doesn't have to be listening when the event occurs, it will receive it a little bit later. That's what makes this architecture so resilient to the challenges that the cloud provides us, this extreme form of louse coupling: loose coupling in time. By decoupling the processing of an event from the moment that it was raised, it allows any part of the system to go down without interrupting the overall service.

But what is an event anyway? An event is a notable thing that happens inside or outside your business. It's a state change that may signify a problem, an opportunity, a threshold, or a deviation. In short it is something interesting, defined in business terms, which has happened. Note that the term event is often used interchangeably to refer to both the specification of the event, and each individual occurrence of the event. But always remember these events find their roots in the business for which the architecture is being applied.

Imagine that you're architecting a solution which has to support a global community around some really popular musician, such as a DJ, rock star or opera singer (Pick one depending of your personal preferences). The goal of the system is to build up a community of millions of fans and use that community to drive ticket sales and merchandising. In such a system you would find events like "New Tickets Available", "Event Review Submitted", "Contest Opened" etc. Events represent some real life occurrence that is of interest to anyone who subscribes both machine and human.

From the technical point of view an event consists of a header and a body, the header contains meta-data about the event such as when it occurred, a unique identification, time stamp, correlation information and so on.

The body contains the real event information; this information should be completely self-descriptive so that event consumers should not lookup information elsewhere in order to treat the event. Note that in reality many EDA architectures pass around reference data only, usually the primary key. This is a performance optimization that reduces the payload of the event, but in return gives you a high coupling with the service that provides the entity associated to the reference data. In a true EDA, as EDA is intended, the message is supposed to contain all of the information. Especially when you take into account that the Azure platform might suffer from replication latency, I suggest sticking to fully descriptive events.

So an event is something interesting that happened earlier, but who decides what is interesting? The service from which the event originates is not capable of determining what is interesting for listening services. Furthermore, each listening service might be interested in different events from the same event source. In order to solve this duality there needs to be an additional layer in between that has the capability to decide which event is of interest to which listener, a so called event processing layer. On the Azure platform, the Internet Service Bus is especially well suited to take up this responsibility.

Command query separation

Before we continue on the discussion on EDA, I would like to deviate a little bit. I would like to return back to the 80's. In that era Bertrand Meyer devised a concept called Command-Query Separation (CQS) for the Eiffel programming language. Command-Query Separation states that any method should either be a command that performs an action, or a query that returns data to the caller, but not both. More formally, methods should return a value only if they are referentially transparent and hence possess no side effects. In other words, asking a question should not change the answer. Even though CQS is a programming principle, applying this principle at the architectural level has great value and benefits, especially when combined with EDA in large scale environments.

Let's have a look at an example to illustrate this. Imagine that one part of our system has the responsibility to provide information about upcoming gigs. Such an engine primarily provides read-only services that allow community members to view the upcoming parties. On the other hand it also provides facilities to the owner for editing and submitting gigs on the calendar. Both of these services have very different requirements in terms of scale and availability. The read-only services have to be far more available and should serve many more users than the editing services as the latter are only targeted at one single person, or maybe a small department, but the former targets potentially millions of visitors. Command query separation

Therefore you can create a separate service for the queries, for which you start multiple instances dependent of the number of visitors. These services have their own data store, which contains all the necessary data in a separate table so that the data is not influenced by any editing activity. Often there is only one table per service, which contains a de-normalized copy of the data, so that no additional processing is required for reading it. The data is optimized to be read only.

As this data is a replica anyway, you can also maintain an in memory cache to reduce dependencies on the data store and further increase the availability of the service. The service should however be prepared to query the private data store if the information is no longer available in memory. On the Azure platform you can cache data using an in memory store or the Azure local storage feature, Table Storage is perfect for use as the private data store.

In order to facilitate the editing of gigs, you can create another service of which only one or two instances are running as there is neither a real scaling nor availability requirement. The commands are delivered to these services and they perform the necessary business logic such as spell checking on the announcement, branding features like resizing images, etc. These services have their own data store, probably a normalized store such as SQL Data Services or a set of tables in Table Storage. These tables are optimized for writing and storing the structure of the objects, so that it can keep track of the editing history for example.

Notice that there is a clean separation of services fulfilling queries and commands, both at the data level as the business logic level. Synchronization of the different stores is required at some point or another though, to make sure that the querying services get updated with the latest information. In this scenario a new gig announcement has to propagate to the query services when the user decides to publish the event, maybe after a review of the impresario. And that's where EDA steps into the picture again: The editing services will raise a "Gig Announced" event to which the querying services are subscribed through the event processing engine. The query services will update the private data store, and local cache, with the new information found in the event body, so that users will be able to read it.

Now that we have covered the concepts of EDA and CQS, and have looked at how they can be combined to a highly available architecture for the cloud, it is time to look at the different parts of such a system. Furthermore we are going to look at the facilities, provided by the Azure platform, which can help us to build one.

Event flow layers

An event driven architecture is composed of four logical layers, it starts with the generation of an event, then the event is transported through an event channel, gets processed, and finally causes a set of reactions to that event.

Event generator

The first logical layer is the event generator, as its name implies this can be any piece of software that generates an event. On the Azure platform both web roles and worker roles can act as an event generator. The web role exposes a web application or synchronous web service to the outside world. It generates events when either an action is taken by a human in the web application or a web request comes in at a web services' endpoint. Raising this event will probably result in a correct treatment of that request, most likely a state change.

As there is no distributed transaction support on the Azure platform, we need to rely on other techniques to make sure that our system eventually becomes consistent, just like any large on-premise system has to do today. Several techniques exist to achieve this, such as tentative operations and apology-based computing. An in depth look at these techniques is outside of the scope of this article, but the main idea is the following. When the request comes in, the service does nothing but publish an event into the event channel and then it returns to the caller an acknowledgement that the request was received and everything will be taken care of. In case something will go wrong later on in the processing of that event, a 'failure' event will be raised leading to an appropriate action to notify the caller of the problem, the apology.

In our example system we could allow community members to request tickets for upcoming parties, when such a user does we will simple reply that the request has been received and further action will be taken. Once the system picked up the incoming request, it will start processing. Probably tickets are available and a payment was received, so we can send the tickets to the requestor. In case either of the above conditions was not satisfied, we can send the user's money back. Note that I speak in the past tense quite a lot in these examples, those words are all events.

Event Channel

The second layer, the event channel, is a piece of infrastructure that transfers the event from the event generator to the event processing engine. Usually some kind of durable queuing mechanism is used where the events are stored until they can be processed by the event processing layer. On the Azure platform there exist quite a number of facilities that can be used as an event channel.

A first candidate is the NetEventRelayBinding which allows event generators and event processors to rendezvous at a well known endpoint in the .Net Service Bus; it provides a non-durable queue so that some time can elapse between sending of the event and receiving it. But the real power of this binding is the embedded logic to setup the communication. It can setup a connection between publishers and subscribers that are sitting behind a corporate firewall or even a NAT. To me this binding seems to be the primary choice to connect services that are not hosted in the cloud but on premise.

A second option that you can use is the queue storage facility provided by the Azure platform itself. Queue storage provides a durable queuing mechanism and is primarily targeted to communication between different components in the cloud.

The third possibility and in my opinion the real powerhouse is the .Net Service Bus, which provides queues as well. These queues also provide a durable queuing mechanism, but in contrast to queue storage, these queues can be configured more dynamically. In fact these queues are exposed in the service bus as just another address in the projects namespace and gain their functionality because they are associated with a queue policy. These policies can be used to determine the longevity of the queue, the capacity and size of the messages for example. So functionally these queues are similar to queue storage, but they are far more flexible in usage.

Event Processing Engine

The third layer, the event processing engine, is the logical location where the event is identified and the appropriate reaction is selected to be executed. Multiple reactions can be chosen by the engine for the same event. For example, the processing engine can decide to forward the event to other event processors. On the other hand it could choose to invoke a downstream activity like invoking a part of a business process, such as reserving some seats in the VIP area of a party, or send an email to the original requestor to tell him or her that everything went horribly wrong and he or she will get their money back.

The primary tool for processing events on the Azure platform is the worker role; it continuously polls a queue for new events and responds to them by executing custom code. Every queuing mechanism discussed in the previous paragraph supports the queue-peek-lock mechanism to ensure that when a message is read from the queue, it does not really get removed from the queue. Until the processor has finished his work and explicitly removes it from the queue, the message will only go invisible for some time. If the processor fails to remove the message within an allotted time frame it will reappear as this indicates that the processor has failed to handle the message. This mechanism provides some sort of retry mechanism to make the processing a bit more resilient to the challenges that the cloud presents us.

The .Net Service Bus provides another feature that is vital for the implementation of an event process layer, routers. A Router is a publish/subscribe message distribution primitive that allows to push events to subscribers. Routers can be configured through policies, just like queues, to distribute the events they receive to all or one of the subscribers, meaning they can be used to forward messages from one or multiple publishers to one or multiple subscribers.

The real beauty however is the fact that queues and routers can subscribe and read from each other, allowing the composition of these primitives in a customized event processing engine, as shown in the example below. They are especially powerful when combined together with worker roles that take care of any complex routing, filtering logic or correlation of events.

Event Processing Engine

Downstream event activity

The last layer consists of the downstream event activity. This represents the consequences of the event. A single event, or event correlation, may initiate numerous downstream activities. The invocation of the activity might be a push by the event processing engine, for example a service invocation, business process initiation, or notification, or it might be a pull by subscribers of event publications such as an RSS feed.

Subscribers might be humans, applications, active business processes, data warehouses, performance dashboards, or even automated agents.

Event processing styles

Queues, routers and worker roles can obviously be combined in an almost infinite number of combinations to support your specific business needs. This is both a good thing, as the engine can easily be adapted to business needs, but also a curse from time to time as complexity might get out of hand. At least this is the case if you don't apply the appropriate style to the problem at hand.

There are three general styles of event processing that you can choose from to implement a specific business requirement: simple, stream, and complex. But you're not limited to stick to any of these, very often the three styles are used together in a mature event driven architecture. Just use them appropriately.

Simple event processing

Simple event processing concerns events that are directly related to specific, measurable changes of condition. In simple event processing, a notable event happens which directly initiates downstream activities. Simple event processing is commonly used to drive the real-time flow of work, thereby reducing lag time. We have already discussed an example of a simple event earlier in this article, when a new gig is announced then the read-only services should update their cached information.

Simple event processing

Stream Event Processing

In Event Stream Processing (ESP), both ordinary and notable events happen. Ordinary events are screened for notability and streamed to information subscribers. Stream event processing is commonly used to drive the real-time flow of information in and around the enterprise, which enables in-time decision making. For example, while community members are reserving tickets for the next big party, the event processing engine is looking for reservations of VIP tickets in order to prioritize their processing or invoke additional business processes like ordering a limousine.

Stream Event Processing

Complex event processing

Complex event processing (CEP) allows patterns of simple and ordinary events to be considered to infer that a complex event has occurred. Complex event processing evaluates a confluence of events and then takes action. The events (notable or ordinary) may cross event types and occur over a long period of time. The event correlation may be causal, temporal, or spatial. CEP is commonly used to detect and respond to business anomalies, threats, and opportunities.

For example, ticket requests can be correlated with member activity on the forums in order figure out which members contribute the most to the promotion of a gig. This information could then be used to elect MVP members and reward them with some nice merchandising.

Complex event processing

CEP requires the employment of sophisticated event interpreters, event pattern definition, matching, and correlation techniques. You need to store all interesting messages, which will probably result into millions and millions of messages if not kept in check with some smart cleanup or archival strategies. You need to compare the meta-data of each incoming message, timestamps for example, with those stored locally in order to detect a relationship and lots more. So you can imagine that it is really hard work to implement such a processor manually. Currently I'm not aware of any such engine for the Azure platform, but I guess some of the vendors will step up soon enough to fill that space. But . maybe you have some spare time left .

Conclusion

The cloud offers great new possibilities for architects, but also imposes some constraints.

In this article I have described how an Event Driven Architecture offers several ways to cope with these constraints. By inverting the communication between caller and callee, making heavy use of asynchronous communication patterns, durable queues and read-only data caches, it allows you to cope with temporal unavailability of parts of the system. Furthermore by embedding all data that describes a business event in the event body allows any subscriber to react based on the latest information, and thus it has no problems with any potential data replication latency. Finally the lack of distributed transactions can be overcome by relying on tentative operations and apology based computing.

I have also shown you the building blocks that the Azure platform offers to implement each layer of EDA based solutions and finally I have described 3 event processing styles to help you find the architecture that is best suited to your particular business problem.

About The Author

Yves Goeleven works as a solution architect for Capgemini Belgium where he leads a special interest group on architecting and designing solutions for the Azure platform. Yves has always been focused on gathering and promoting good architectural and design guidance to be used together with Microsoft technologies. If you would like to learn more about any of this, don't hesitate to visit his blog.

Resources


spacer stripe Issue

It's all about shortcuts on desktops...

“Why have infrastructure architects failed to deliver the promise of a low TCO desktop platform? Avanade describes three practical solutions that help to establish a clear and simple application provisioning strategy.” Read More

How do you fit document management in your process-based architecture?

“Many organizations have implemented a Document Management Systems but fail to integrate document flows in the rest of the ICT Architecture in particular LOB-apps. Spikes provides a strategy to achieve just that!” Read More

Get rich and famous thanks to the Azure Services Platform

"New CloudApp()" is a Microsoft developer challenge (Contest Site) that promotes the new opportunities and innovative ideas developers (.NET & PHP ) are creating with cloud computing on the Azure Services Platform. Read More
footer
spacer © 2009 Microsoft Corporation - Terms of Use | Trademarks | Privacy Statement
spacer rounder