Scalability / Resiliency — Practical steps beyond what we learnt in books

4 min readApr 16, 2022

I wrote some blogs on Performance. I have also seen and read many excellent articles in Resiliency, Availability, Scalability. I won’t talk about them as they are readily available in different breadth and depth in some great websites. In this article, I will talk about my learning on the job. I will share some tips / steps which are extremely important while planning to support scalability and resiliency of an application. Let’s get into it.

Few Ground rules for Digital Transformation

Do not hurry — a) move small things, b) then measure the reliability metrics, c) fix any issues, d) remeasure it, e) Move the next item
What’s is reliability metrics?
* The application Performs the functions as expected [ Measure Current and Modified (Transformed) version for each user requests through anomaly metrics]
* It’s performance is good enough for the required use case, under the expected load and data volume
* The System prevents any unauthorised access i.e. your authentication / authorization engine also should be migrated or stay on premise but ensure it’s managed right
* Last but not least, make the application fault tolerant from user faults or system fault

The things that can go wrong is called fault, and the systems that anticipate faults and can cope with are called fault-tolerant or resilient. Although making a zero fault system may not be practically possible but during design engineers should identify all the possible faults and the possible self-heal or fallbacks

3. Design your Storage and Database — While migrating two key parameters are performance and cost and balance between them. check the below links before choosing the right solution -

https://aws.amazon.com/startups/start-building/how-to-choose-a-database/

https://aws.amazon.com/products/storage/

4. Choose how can you migrate your application code. Check the below link

Guide to Application Digital Transformation in Cloud & On-Premise — Part 3 (Technical Solution)

Digital Transformation & Cloud are the buzz words in IT for last 4–5 years and after COVID, the Remote way of work…

tapobrata-chatterjee-blogger.medium.com

The Basic Staffs

You application should have the below checks to ensure self heal / scalability

Retry
Usage of messaging
Auto Scaling or Manual Scaling
Throttling
Degrade gracefully

Azure platform resiliency

Building a reliable application in the cloud is different from traditional on-premises application development. While…

docs.microsoft.com

Application resiliency patterns

The first line of defense is application resiliency. While you could invest considerable time writing your own…

docs.microsoft.com

How to be ready for Peak Traffic

Calculate the Projection

Part 1 — Calculate the Current TPS / Latency

What’s your Current max daily traffic of the services / applications. Let’s say it’s 100 TPS (Transaction / Second)
What’s the Latency SLA? Let’s say 50 MS / Transaction
How many hosts do you need to support the 50 MS or below? Let’s say 5 Hosts
So, what’s your Max TPS / Host? for our example it’s 100 / 5 = 20 TPS / Host

Part 2 — Calculate the scaling factor and required Hosts

What can be possible traffic during Peak? let’s say during thankingGiving or Christmas or before Diwali we are expecting the traffic to go up. Let’s say the Max traffic can be 400 TPS.
The required Host will be 400 / 20 = 20. So you have to arrange that additional hosts.

What all releases did you have in last 6 months

Assuming You calculated the traffic projection and hosts required to keep latency under check. Now the next important checklists. To start with -

Did you add any new features or modify existing features which may cause latency increase for your services? for example, instead of calling one downstream service, you may be calling two services or you added a new API which may increase TPS for that service hence increase latency. If you have any impact in latency, you need to adjust the host count to service the Peak traffic
Did any major facelift happened in your product for that instead of passing the traffic to service A, you are passing the traffic to service B? If Yes, what mechanism are you using to fall back to Service A in case any issue occurs in Service B [You can use Circuit Breaker kind of pattern or keep both your Service A, Service B in Hot, Cold Mode where you have some mechanism to control traffic to A or B ]
Is there any upstream which will increase load in your service? There should be a mechanism to be informed for any such events and there has to be a SLA between upstream / down streams
Are your service calling any downstream? Are they notified for any change in traffic pattern? and the Peak max traffic you will receive? Any excess latency in downstream can make your service slow. Be careful on that.

Review Alarms / Monitoring

Do you have Alarms set if latency, CPU, Memory, TPS crossed a max limit?
Do you have metrics, graphs in place for Auto / manual monitoring?

Are you being Over Aggressive or too lenient?

While Availability, Self Healing, Performance are key to ensure no or minimal customer impacts, cost of infrastructure is equally important
Do you have measures in place on a) Budget vs Actual spending b) Are you over-scaled? i.e. CPU / memory utilisation is too low c) You have un-utilised hosts in your pool for a decent long period of time
There should an Utilisation target against all services / Team to ensure governance over utilisation

Final Steps

Nothing should be blocking your Service pipeline to ensure you make deploy any emergency fix during Peak as needed
War-Room — Setup War-Room during Peak hours or Peak days with all key people in the room supporting the “D” day