When it comes to migrating large, mission-critical systems to the cloud, the term “lift-and-shift” can be deceptively simple. While the tools for moving virtual machines have matured, the underlying physics of a distributed cloud environment introduces complexities that can derail even the most carefully planned projects. To shed light on these hidden challenges, we spoke with Vijay Raina, an expert in enterprise SaaS technology and cloud architecture. He shared critical insights from a recent large-scale Azure migration, focusing on how seemingly standard best practices can lead to catastrophic performance failures and how to architect solutions that respect the physical realities of the cloud.
When migrating legacy batch jobs, teams often split servers across Availability Zones for high availability. How can this common best practice unexpectedly introduce critical performance bottlenecks, and what architectural adjustments are necessary to resolve them?
It’s a classic case of a best practice for modern applications creating a nightmare for legacy systems. In an on-premises data center, your application and database servers might be in the same rack, with latency so low it’s almost zero. When you migrate to the cloud, the instinct is to follow the golden rule of availability and place those servers in different physical data centers, or Availability Zones. The problem is, you’ve just introduced physical distance. Light has a speed limit, and that translates to network latency. We saw this firsthand where a batch job running a million sequential SQL statements went from a 27-second network overhead inside one zone to nearly eight minutes when split across zones. That’s a staggering 17x increase in pure network wait time, and it was enough to make the process miss its critical SLA window.
Imagine you’re leading a migration where a crucial batch process is failing its SLA, yet all the standard monitoring dashboards for CPU and I/O look perfectly healthy. Walk us through how you would diagnose this kind of invisible performance issue and when you’d decide to implement a feature like a Proximity Placement Group.
That scenario is precisely where experience trumps standard monitoring. When CPU, memory, and disk I/O are all in the green but performance is tanking, my mind immediately goes to the network, specifically latency. The first step is to validate the architecture. Are the communicating components in the same virtual network? Yes. Are they in the same region? Yes. Are they in the same Availability Zone? If the answer is no, that’s our prime suspect. The next step is to quantify it. We measure the round-trip time between the servers. Seeing a consistent latency of around 470 microseconds instead of the expected 27 microseconds confirms the cross-zone communication is the culprit. At that point, the solution becomes clear. We have to make a conscious trade-off. For a time-sensitive batch process, meeting the SLA is more critical than surviving a rare zonal outage. We would then implement a Proximity Placement Group, which is essentially a command to Azure to place these specific VMs as physically close as possible, usually in the same data center hall. It’s a deliberate decision to prioritize performance over high availability for that specific workload.
Let’s shift to another common problem. Engineers often size their cloud gateways based on bandwidth, but then users start reporting random connection drops even when utilization is well below 30%. What invisible limit are they likely hitting, and how have modern SaaS applications made this issue more common?
They’re hitting the flow count ceiling, a hard limit on network appliances that is completely independent of bandwidth. A network “flow” is a unique connection defined by the source and destination IPs, ports, and the protocol. Every single connection your users make, whether to an internal app or a SaaS tool, consumes one of these flows. Modern applications, especially collaboration suites like Office 365 or Teams, are incredibly “chatty.” A single user can easily open 50 to 60 concurrent flows just by opening their browser, syncing a file, and having a chat window active. If you have 10,000 users funneled through a central ExpressRoute gateway that has a limit of 500,000 flows, you can do the math. You’ll hit that limit long before you saturate a 1 Gbps pipe, and when you do, the gateway simply starts dropping packets for new connection requests. It feels random and mysterious to the user, but it’s a very real and predictable infrastructure limit.
To solve that gateway connection limit, an organization could either upgrade the hardware SKU for a higher flow count or re-architect the network for a local breakout. How would you guide a client in choosing between these two paths, considering both cost and long-term effectiveness?
The choice really comes down to whether you want a quick fix or a strategic solution. Upgrading the gateway SKU is the path of least resistance. It’s faster to implement—you’re essentially just paying for a more powerful virtual appliance. However, it’s a tactical move that incurs higher recurring costs and doesn’t address the root cause, which is that you’re backhauling unnecessary traffic through your private network. I would typically recommend this only as a short-term stopgap to restore service immediately. The more robust, long-term solution is to implement a local breakout, also known as split tunneling. This involves reconfiguring your network to route trusted, high-volume SaaS traffic—like Microsoft 365—directly out to the internet from the user’s location, completely bypassing your Azure gateway. It’s more complex to set up initially, but it’s far more scalable and cost-effective, as it reserves your expensive private connection for traffic that actually needs it.
Given how disruptive these flow limits can be when discovered in production, how can an organization get ahead of the problem? Could you describe a practical way to estimate connection needs before the migration even begins?
Absolutely, you should never wait for production to find these limits. The most effective way to get ahead is to model your user behavior before you move. You don’t need to analyze everyone, just a representative sample. The process involves using a packet capture tool, like Wireshark, on a typical user’s machine for a day. You then analyze that capture file to count the number of unique connections, or flows, they generate. You can even write a simple script to parse the capture file and count the unique 5-tuples. Once you have an average number of flows per user—say, 60—you can multiply that by your total user count to get a realistic estimate of the concurrent flows your gateway will need to handle. This data-driven approach transforms sizing from a guess into an informed decision, preventing very costly surprises down the road.
Do you have any advice for our readers?
My main piece of advice is to always respect the physics of the cloud. “Lift-and-shift” sounds like a simple copy-paste operation, but you are moving from a highly controlled, low-latency LAN environment to a distributed, WAN-based one. Assumptions that held true in your data center, like negligible latency between servers or unlimited connections, do not apply. Before you migrate, question everything. If two components are chatty, ask if they can tolerate being in different buildings. If you’re routing all user traffic through a single pipe, calculate the number of connections, not just the bandwidth. Addressing these physical and logical constraints during the design phase is the difference between a smooth migration and a frantic, post-launch firefighting effort.
