I eat words @group.lt 2y ago

They cut all such scenes and pasted into The Boys, in a Mark Twain style “Sprinkle these around as you see fit!”.

I eat words @group.lt 2y ago

no

I eat words @group.lt 2y ago

Reread today again, with some highlights:

Lessons Learned from Twenty Years of Site Reliability Engineering

Metadata

Author: sre.google
Category: article
URL: https://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned/

Highlights

The riskiness of a mitigation should scale with the severity of the outage

We, here in SRE, have had some interesting experiences in choosing a mitigation with more risks than the outage it's meant to resolve.

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

Recovery mechanisms should be fully tested before an emergency

An emergency fire evacuation in a tall city building is a terrible opportunity to use a ladder for the first time.

Testing recovery mechanisms has a fun side effect of reducing the risk of performing some of these actions. Since this messy outage, we've doubled down on testing.

We were pretty sure that it would not lead to anything bad. But pretty sure is not 100% sure.

A "Big Red Button" is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever's happening.

Unit tests alone are not enough - integration testing is also needed

This lesson was learned during a Calendar outage in which our testing didn't follow the same path as real use, resulting in plenty of testing... that didn't help us assess how a change would perform in reality.

Teams were expecting to be able to use Google Hangouts and Google Meet to manage the incident. But when 350M users were logged out of their devices and services... relying on these Google services was, in retrospect, kind of a bad call.

It's easy to think of availability as either "fully up" or "fully down" ... but being able to offer a continuous minimum functionality with a degraded performance mode helps to offer a more consistent user experience.

This next lesson is a recommendation to ensure that your last-line-of-defense system works as expected in extreme scenarios, such as natural disasters or cyber attacks, that result in loss of productivity or service availability.

A useful activity can also be sitting your team down and working through how some of these scenarios could theoretically play out—tabletop game style. This can also be a fun opportunity to explore those terrifying "What Ifs", for example, "What if part of your network connectivity gets shut down unexpectedly?".

In such instances, you can reduce your mean time to resolution (MTTR), by automating mitigating measures done by hand. If there's a clear signal that a particular failure is occurring, then why can't that mitigation be kicked off in an automated way? Sometimes it is better to use an automated mitigation first and save the root-causing for after user impact has been avoided.

Having long delays between rollouts, especially in complex, multiple component systems, makes it extremely difficult to reason out the safety of a particular change. Frequent rollouts—with the proper testing in place— lead to fewer surprises from this class of failure.

Having only one particular model of device to perform a critical function can make for simpler operations and maintenance. However, it means that if that model turns out to have a problem, that critical function is no longer being performed.

Latent bugs in critical infrastructure can lurk undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while incurring costs of its own, can mean the difference between a troublesome outage and a total one.

I eat words @group.lt 2y ago

This is what you get when are not sleeping during biology classes.

I eat words @group.lt 2y ago

a source code of a game ;))

I eat words @group.lt 2y ago

i am all for normalizing raiding ambassies for [put the cause you support] as well

2

I eat words @group.lt 2y ago

woah, so nothing is sacred now? 😱🤔😐

1

I eat words @group.lt 2y ago

thank you, actually it seems that it is https://en.m.wikipedia.org/wiki/The_Sliced-Crosswise_Only-On-Tuesday_World , which has inspired Dayworld :)

I eat words @group.lt 2y ago

looks interesting, but not this one.

1

I eat words @group.lt 2y ago

can do, if you could provide the link to the debunking source - would be great!

1

I eat words @group.lt 2y ago

nice, thank you.

8

I eat words @group.lt 2y ago

This might be FUD, but... Vastaamo hacker traced via ‘untraceable’ Monero transactions, police says. (Edit) - A video debunking the police report - https://yewtu.be/watch?v=7CD_Nl3iwhE

1

I eat words @group.lt 2y ago

Agree, but five nines are not 100% ;) Anyway - this discussion reminds me of Technical Report 85.7 - Jim Gray, which might be of the interest to some of you.

I eat words @group.lt 2y ago

a lot of things are possible if you are lucky enough ;)

15

I eat words @group.lt 2y ago

well this is probably PR as there is no such system nor it can be made that can have 100% uptime. not talking about the fact that network engineers rarely work with servers :)