Taking too much slack out of the rubber band

roland35 · on Nov 11, 2019

I see this kind of thinking all the time in hardware engineering as well, and it all boils down to premature optimization. Cost almost always is driving this.

One example is a recent project was a very cost-sensitive machine in which a small heater was copied over from another product, but no one actually verified that it was good to the required limits (just the default use case). Well, turns out it wasn't quite powerful enough but it is way too late and expensive now to fix it at the end! Also, all the engineering time was wasted to figure this out (but it often seems management doesn't count engineering time the same way as parts cost)!

I've since learned that in the beginning of a project it is critical to identify the riskiest parts of the design and try to isolate that to a module and over-spec it, hopefully with a path to reduced cost later on. But the most important thing I've learned is don't try to solve tomorrow's problems today!

techslave · on Nov 11, 2019

> management doesn't count engineering time the same way as parts cost

because the IRS doesn’t either

falcolas · on Nov 11, 2019

When you count for the employment tax for every employee, it's not exactly cheap (around 30-35% of the base salary). The difference really is that whether the engineer is looking at the heating coil or not, the company is still paying them.

organsnyder · on Nov 11, 2019

> The difference really is that whether the engineer is looking at the heating coil or not, the company is still paying them.

And while they're looking at that heating coil, they're not doing something that could be generating more value for the company. But opportunity cost is extremely hard to measure.

techslave · on Nov 12, 2019

no, the difference is whether the cost is allocated to fixed or overhead cost (presumably scaling slowly) or COGS (presumably scaling linearly).

increase in COGS looks very bad for the business.

https://www.investopedia.com/ask/answers/101314/what-are-dif...

roland35 · on Nov 14, 2019

It certainly depends a lot on your quantities... if you are only making 100-200 of something a year then I think engineering time would dominate the calculation. But as with all things, it depends and you need to do the math for your own situation!

amalcon · on Nov 11, 2019

I've spent quite a bit of time on a problem very similar to this. It's surprisingly challenging. Imagine this scenario:

Some service has three units of capacity available (e.g. VMs). This is the minimum amount allowed, on the theory that things won't break too badly if one of them happens to crash. You target 66% CPU utilization. Suddenly, one goes down, and the software sees 100% CPU utilization on the other two. What should the software do?

Well, the obvious thing is to add one more instance, assuming that one of them crashed and its load shifted to the other two. However, what if the thing that actually happened is that the demand doubled, and the load caused the crash? Then, you should probably add six more instances (assuming that the two remaining live ones are going to go down while those six are coming up).

If you look at only CPU utilization, it's impossible to tell the difference between these two situations.

sohex · on Nov 12, 2019

Which is why proper monitoring and understanding of the system as a whole is imperative. Utilization doesn't come from nowhere, be it CPU, memory, or anything else. If you understand that requests per second x generates CPU usage y then you can monitor at the edge and scale according to actual need.

dehrmann · on Nov 11, 2019

This is even scarier in the physical world. Just-in-time logistics means companies aren't warehousing inventories as large as they used to. In the case of major events (natural disasters, terrorist, etc.), there isn't enough reserve supply to go around.

Enginerrrd · on Nov 11, 2019

This has been a major generational cultural shift. I learned from old-timers who grew up in the depression and have a 3-month supply of food and batteries and a water filter in their garage, a tow-strap and toolkit in their car, a shotgun in their safe, and money in seven different bank accounts as well as at least 3 cash stash spots and some gold hidden away somewhere. They grew up in a time where almost nothing has the reliability that (we think) it does today. A less safe time all around.

The reality is, we have a much more interconnected web of dependencies with little capacity to absorb disruptions. We'll almost certainly see some much more significant consequences when those now low probability events finally do occur.

Xylakant · on Nov 11, 2019

Doesn’t need to be a natural disaster. Not too long ago, a whole quarter of Berlin (Köpenick) was entirely without electricity for about 2 days because somebody drilled into both the primary and the backup 110kv cable. No water is the primary issue, but on top, without electricity, most people can’t cook their food stores and frozen/cooled stuff spoils. No shop can sell anything - cards and cash registers not working.

SideburnsOfDoom · on Nov 11, 2019

The dynamic scaling version of cascading failure

jes · on Nov 11, 2019

It’s important that systems have some design margin (buffers of one kind or another) so that a disruption / transient event in one part of the system is absorbed locally and not passed on to the rest of the system.

thaniri · on Nov 11, 2019

It seems like this problem is solved by simply setting a sensible minimum in an autoscaling group. And not "everyone on Earth was abducted by aliens and stopped using the service" levels of minimum.

Say I'm an e-commerce site and on Black Friday I can see historically (or just make an educated guess if it's your first holiday sale) I get "n" requests per second to my service.

I'll set my autoscaling group the day before to be able to handle that "n" number of requests, with the ability to grow if my expectations are exceeded. If my expectations are not met, then my autoscaling group won't shrink. Then the day after the holiday sale, I can configure my autoscaling group to have a different minimum.

This solves the problem of balancing between capacity planning and saving money by not having idle resources running.

If you're the type of person who hates human intervention for running your operation, then fine. Put in a scheduled config change every year before a sale to change your autoscaling group size.

It's pretty rare to have enormous spikes in application usage without good reason. Such as video-game releases, holiday sales, startup openings, viral social media campaigns.

pbalau · on Nov 11, 2019

> It seems like this problem is solved by simply setting a sensible minimum in an autoscaling group.

Do you really think people do things because it makes sense to do them for their particular situation or because those things are "the thing to do(tm)"?

Most people go to see Mona Lisa because that's what people do when in Paris, not because they care about that particular piece of art.

Same with automation. It really makes me sad when I see people "automating" things they barely understand how to manually do, let alone the "when" to do it.

Yes, your example is perfectly valid, but that means one understands the system they are working with and generally people have no bloody clue about what they are doing.

GauntletWizard · on Nov 12, 2019

I recently gave a talk at SRECon [1] about a partial solution: Using a PID controller. It won't solve all instances of this problem, but properly tuned, it will dampen the effect of these sudden events and quicken the response times to them.

[1] https://www.usenix.org/conference/srecon19emea/presentation/...

thunderbong · on Nov 11, 2019

Money quote - "Capacity engineering is no joke."

dgritsko · on Nov 11, 2019

> Of course, at some point, [...] the local service gets restarted by the ops team (because it can't self-heal, naturally)

Maybe off-topic, but what are some good strategies for the kind of "self-healing" being talked about here? If a service needs to be restarted, how could you automate the detection and restart process?

perlgeek · on Nov 11, 2019

In the simplest case, the service could shut itself down, and the supervising daemon / scheduler would restart it.

Supervisors like systemd also have a watchdog that will force-restart a service that hasn't checked in for some time.

For a service that manages its own network connection, implementing auto-reconnect can be a form of self-healing (and surprisingly hard to get right in all edge cases).

The key is, as Rachel wrote in the OP, to get a good signal. You need to be able to distinguish a working from a non-working service to implement reliable self-healing.

dgritsko · on Nov 11, 2019

> You need to be able to distinguish a working from a non-working service to implement reliable self-healing.

I think this is the crux of what I was trying to get at. Curious to read how others have approached this problem.

patmcguire · on Nov 11, 2019

There's something related called the bullwhip effect. I think that throwing away requests under load rather than putting them in some overflow queue prevents it. The effects aren't magnified down the chain of services as each scales up because it's only incoming traffic.

jerkstate · on Nov 11, 2019

dynamically scaling down based on cpu consumption is the wrong way to do it IMO. if your site is decently sized you have a pretty typical diurnal pattern with weekly cyclical variation, that's your baseline.

NoodleIncident · on Nov 11, 2019

> For another thing, how about knowing approximately what the traffic is supposed to look like for that time of day, on that day of the week and/or year? Then, don't down-scale past that by default?

scarejunba · on Nov 11, 2019

Ran a few ad pixel servers. CPU was fine for that. Diurnal cycle, weekly cycle, handled holidays, handled sudden spikes. Pretty trivial ASG on AWS globally. 2 million rps.

insanebits · on Nov 11, 2019

But if your service was down for more than what it takes to downscale to minimum scaling back up is not that big of an issue. It was down anyways. Also 24/7 instances exist for a reason, autoscaling is there to handle spikes, not a normal traffic.

Arnt · on Nov 11, 2019

Pay attention, and don't confuse intention with effect.

What she's saying is that if you configure scaling such that it'lll scale down when demand is unusually low, and then demand returns, the spike may be a difficult one to handle, particularly if your services depend on each other but each scales only based on its own history.

If A needs B needs C, and demand suddenly returns to A, does that cause C to scale up? Or will A scale up first, and but C stay low for another half-hour because it recently scaled down?

Having C stay under demand for a half-hour after an outage ends wasn't anyone's intention when the autoscaling was configured. But as I wrote, don't confuse intention with effect.

DasIch · on Nov 11, 2019

That just means you should scale based on the work to be done rather than poor proxies such as CPU utilization. Also set a reasonable minimum and maximum based on observed load in production and review this as part of regular operational reviews.

svacko · on Nov 11, 2019

OT: can you update the link to use the https version of the site? The author has not implemented http->https redirect for the site yet.

to1y · on Nov 11, 2019

Out of interest what difference does it make to you?

taneq · on Nov 11, 2019

Do you not run HTTPS Everywhere?

Jonnax · on Nov 11, 2019

It's an extension which on the chrome store has 2,213,581 users and 710,051 users. Clearly that doesn't encompass everybody.

svacko · on Nov 11, 2019

nope. I'm often involved in touching various infra stuff in our engineering team (load balancers, TLS thingy, etc..) so want to see noticably when the HTTPS redirects are not implemented on the site or when debugging some issues.

SignalsFromBob · on Nov 11, 2019

My employer has Chrome configured to prevent installation of extensions.

matheusmoreira · on Nov 11, 2019

Mobile Chrome does not support extensions.

eru · on Nov 11, 2019

Off-topic: Firefox on Android does support extensions. Useful for that crucial ad blocker.

Filligree · on Nov 11, 2019

And it's just as fast as chrome, at least on newer phones. That's a crucial difference from last year.

diminoten · on Nov 11, 2019

Good edge case to consider when designing an auto scaling service, but now that I'm aware of it, I think I'll be able to design around the problem with some combo of the suggested solutions, and still get the autoscaling that I feel like the article was trying to convince me not to do...

acdha · on Nov 11, 2019

I don’t think the article was saying not to auto-scale as much as realizing it’s less of an edge case than it might sound and being careful not to underestimate the level of effort or overestimate the savings. That rang true to me — I’ve seen a lot of people realize the staff time they spent ended up pushing the time to recoup many years into the future. This is especially common if they’re inspired by a big tech company’s cool blog post or talk describing something amortized across a much larger volume.

tus88 · on Nov 11, 2019

If scaling up is painful there is something wrong with the architecture. Aside from this scenario, what if you just get a spike in traffic? If your scaling solution can't handle it, get a better one, otherwise what's the point?

saagarjha · on Nov 11, 2019

Presumably servers take time to boot and initialize, which is still a problem if you get a spike but those aren't as sudden as "everything just turned back on".

bostik · on Nov 11, 2019

Yup, in a reasonably, but not entirely optimised setup, the spinup time for a new node from scale-up event launching to it being able to serve traffic may take 2-3 minutes. And trust me, after a couple of mishaps with very aggressive scale-ups, your system will not launch the full demand absorption all at once.

In a fully optimised setup each service image is itself 100% preconfigured, and only provisions node secrets during the boot. Even one of these types of nodes takes easily 30-40 seconds from launch event to actually serve traffic: it may join the load balancer just 25 seconds in, but the load balancer will want to see at least two good health checks before allowing any traffic to it.

The problem with aggressive upscaling in the depicted scenario is that your plumbing layer is also likely scaled down. Hitting it with a cascade of new nodes has the risk of going all thundering herd, crippling the system for both existing and new nodes.

tqkxzugoaupvwqr · on Nov 11, 2019

Useful anecdote to learn from but not the article I expected from reading the title. I was prepared to read a story about literal rubber bands.