I see this kind of thinking all the time in hardware engineering as well, and it all boils down to premature optimization. Cost almost always is driving this.
One example is a recent project was a very cost-sensitive machine in which a small heater was copied over from another product, but no one actually verified that it was good to the required limits (just the default use case). Well, turns out it wasn't quite powerful enough but it is way too late and expensive now to fix it at the end! Also, all the engineering time was wasted to figure this out (but it often seems management doesn't count engineering time the same way as parts cost)!
I've since learned that in the beginning of a project it is critical to identify the riskiest parts of the design and try to isolate that to a module and over-spec it, hopefully with a path to reduced cost later on. But the most important thing I've learned is don't try to solve tomorrow's problems today!
When you count for the employment tax for every employee, it's not exactly cheap (around 30-35% of the base salary). The difference really is that whether the engineer is looking at the heating coil or not, the company is still paying them.
> The difference really is that whether the engineer is looking at the heating coil or not, the company is still paying them.
And while they're looking at that heating coil, they're not doing something that could be generating more value for the company. But opportunity cost is extremely hard to measure.
It certainly depends a lot on your quantities... if you are only making 100-200 of something a year then I think engineering time would dominate the calculation. But as with all things, it depends and you need to do the math for your own situation!
I've spent quite a bit of time on a problem very similar to this. It's surprisingly challenging. Imagine this scenario:
Some service has three units of capacity available (e.g. VMs). This is the minimum amount allowed, on the theory that things won't break too badly if one of them happens to crash. You target 66% CPU utilization. Suddenly, one goes down, and the software sees 100% CPU utilization on the other two. What should the software do?
Well, the obvious thing is to add one more instance, assuming that one of them crashed and its load shifted to the other two. However, what if the thing that actually happened is that the demand doubled, and the load caused the crash? Then, you should probably add six more instances (assuming that the two remaining live ones are going to go down while those six are coming up).
If you look at only CPU utilization, it's impossible to tell the difference between these two situations.
Which is why proper monitoring and understanding of the system as a whole is imperative. Utilization doesn't come from nowhere, be it CPU, memory, or anything else. If you understand that requests per second x generates CPU usage y then you can monitor at the edge and scale according to actual need.
This is even scarier in the physical world. Just-in-time logistics means companies aren't warehousing inventories as large as they used to. In the case of major events (natural disasters, terrorist, etc.), there isn't enough reserve supply to go around.
This has been a major generational cultural shift. I learned from old-timers who grew up in the depression and have a 3-month supply of food and batteries and a water filter in their garage, a tow-strap and toolkit in their car, a shotgun in their safe, and money in seven different bank accounts as well as at least 3 cash stash spots and some gold hidden away somewhere. They grew up in a time where almost nothing has the reliability that (we think) it does today. A less safe time all around.
The reality is, we have a much more interconnected web of dependencies with little capacity to absorb disruptions. We'll almost certainly see some much more significant consequences when those now low probability events finally do occur.
Doesn’t need to be a natural disaster. Not too long ago, a whole quarter of Berlin (Köpenick) was entirely without electricity for about 2 days because somebody drilled into both the primary and the backup 110kv cable. No water is the primary issue, but on top, without electricity, most people can’t cook their food stores and frozen/cooled stuff spoils. No shop can sell anything - cards and cash registers not working.
It’s important that systems have some design margin (buffers of one kind or another) so that a disruption / transient event in one part of the system is absorbed locally and not passed on to the rest of the system.
It seems like this problem is solved by simply setting a sensible minimum in an autoscaling group. And not "everyone on Earth was abducted by aliens and stopped using the service" levels of minimum.
Say I'm an e-commerce site and on Black Friday I can see historically (or just make an educated guess if it's your first holiday sale) I get "n" requests per second to my service.
I'll set my autoscaling group the day before to be able to handle that "n" number of requests, with the ability to grow if my expectations are exceeded. If my expectations are not met, then my autoscaling group won't shrink. Then the day after the holiday sale, I can configure my autoscaling group to have a different minimum.
This solves the problem of balancing between capacity planning and saving money by not having idle resources running.
If you're the type of person who hates human intervention for running your operation, then fine. Put in a scheduled config change every year before a sale to change your autoscaling group size.
It's pretty rare to have enormous spikes in application usage without good reason. Such as video-game releases, holiday sales, startup openings, viral social media campaigns.
> It seems like this problem is solved by simply setting a sensible minimum in an autoscaling group.
Do you really think people do things because it makes sense to do them for their particular situation or because those things are "the thing to do(tm)"?
Most people go to see Mona Lisa because that's what people do when in Paris, not because they care about that particular piece of art.
Same with automation. It really makes me sad when I see people "automating" things they barely understand how to manually do, let alone the "when" to do it.
Yes, your example is perfectly valid, but that means one understands the system they are working with and generally people have no bloody clue about what they are doing.
I recently gave a talk at SRECon [1] about a partial solution: Using a PID controller. It won't solve all instances of this problem, but properly tuned, it will dampen the effect of these sudden events and quicken the response times to them.
> Of course, at some point, [...] the local service gets restarted by the ops team (because it can't self-heal, naturally)
Maybe off-topic, but what are some good strategies for the kind of "self-healing" being talked about here? If a service needs to be restarted, how could you automate the detection and restart process?
In the simplest case, the service could shut itself down, and the supervising daemon / scheduler would restart it.
Supervisors like systemd also have a watchdog that will force-restart a service that hasn't checked in for some time.
For a service that manages its own network connection, implementing auto-reconnect can be a form of self-healing (and surprisingly hard to get right in all edge cases).
The key is, as Rachel wrote in the OP, to get a good signal. You need to be able to distinguish a working from a non-working service to implement reliable self-healing.
There's something related called the bullwhip effect. I think that throwing away requests under load rather than putting them in some overflow queue prevents it. The effects aren't magnified down the chain of services as each scales up because it's only incoming traffic.
dynamically scaling down based on cpu consumption is the wrong way to do it IMO. if your site is decently sized you have a pretty typical diurnal pattern with weekly cyclical variation, that's your baseline.
> For another thing, how about knowing approximately what the traffic is supposed to look like for that time of day, on that day of the week and/or year? Then, don't down-scale past that by default?
Ran a few ad pixel servers. CPU was fine for that. Diurnal cycle, weekly cycle, handled holidays, handled sudden spikes. Pretty trivial ASG on AWS globally. 2 million rps.
But if your service was down for more than what it takes to downscale to minimum scaling back up is not that big of an issue. It was down anyways. Also 24/7 instances exist for a reason, autoscaling is there to handle spikes, not a normal traffic.
Pay attention, and don't confuse intention with effect.
What she's saying is that if you configure scaling such that it'lll scale down when demand is unusually low, and then demand returns, the spike may be a difficult one to handle, particularly if your services depend on each other but each scales only based on its own history.
If A needs B needs C, and demand suddenly returns to A, does that cause C to scale up? Or will A scale up first, and but C stay low for another half-hour because it recently scaled down?
Having C stay under demand for a half-hour after an outage ends wasn't anyone's intention when the autoscaling was configured. But as I wrote, don't confuse intention with effect.
That just means you should scale based on the work to be done rather than poor proxies such as CPU utilization. Also set a reasonable minimum and maximum based on observed load in production and review this as part of regular operational reviews.
nope. I'm often involved in touching various infra stuff in our engineering team (load balancers, TLS thingy, etc..) so want to see noticably when the HTTPS redirects are not implemented on the site or when debugging some issues.
Good edge case to consider when designing an auto scaling service, but now that I'm aware of it, I think I'll be able to design around the problem with some combo of the suggested solutions, and still get the autoscaling that I feel like the article was trying to convince me not to do...
I don’t think the article was saying not to auto-scale as much as realizing it’s less of an edge case than it might sound and being careful not to underestimate the level of effort or overestimate the savings. That rang true to me — I’ve seen a lot of people realize the staff time they spent ended up pushing the time to recoup many years into the future. This is especially common if they’re inspired by a big tech company’s cool blog post or talk describing something amortized across a much larger volume.
If scaling up is painful there is something wrong with the architecture. Aside from this scenario, what if you just get a spike in traffic? If your scaling solution can't handle it, get a better one, otherwise what's the point?
Presumably servers take time to boot and initialize, which is still a problem if you get a spike but those aren't as sudden as "everything just turned back on".
Yup, in a reasonably, but not entirely optimised setup, the spinup time for a new node from scale-up event launching to it being able to serve traffic may take 2-3 minutes. And trust me, after a couple of mishaps with very aggressive scale-ups, your system will not launch the full demand absorption all at once.
In a fully optimised setup each service image is itself 100% preconfigured, and only provisions node secrets during the boot. Even one of these types of nodes takes easily 30-40 seconds from launch event to actually serve traffic: it may join the load balancer just 25 seconds in, but the load balancer will want to see at least two good health checks before allowing any traffic to it.
The problem with aggressive upscaling in the depicted scenario is that your plumbing layer is also likely scaled down. Hitting it with a cascade of new nodes has the risk of going all thundering herd, crippling the system for both existing and new nodes.
One example is a recent project was a very cost-sensitive machine in which a small heater was copied over from another product, but no one actually verified that it was good to the required limits (just the default use case). Well, turns out it wasn't quite powerful enough but it is way too late and expensive now to fix it at the end! Also, all the engineering time was wasted to figure this out (but it often seems management doesn't count engineering time the same way as parts cost)!
I've since learned that in the beginning of a project it is critical to identify the riskiest parts of the design and try to isolate that to a module and over-spec it, hopefully with a path to reduced cost later on. But the most important thing I've learned is don't try to solve tomorrow's problems today!