All our servers and company laptops went down at pretty much the same time. Laptops have been bootlooping to blue screen of death. It’s all very exciting, personally, as someone not responsible for fixing it.

Apparently caused by a bad CrowdStrike update.

Edit: now being told we (who almost all generally work from home) need to come into the office Monday as they can only apply the fix in-person. We’ll see if that changes over the weekend…

  • Monument@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    Honestly kind of excited for the company blogs to start spitting out their disaster recovery stories.

    I mean - this is just a giant test of disaster recovery plans. And while there are absolutely real-world consequences to this, the fix almost seems scriptable.

    If a company uses IPMI (Called AMT and sometimes vPro by Intel), and their network is intact/the devices are on their network, they ought to be able to remotely address this.
    But that’s obviously predicated on them having already deployed/configured the tools.

    • corsicanguppy@lemmy.ca
      link
      fedilink
      English
      arrow-up
      0
      ·
      4 months ago

      IPMI (Called AMT and sometimes vPro by Intel),

      IPMI is not AMT. AMT/vPro is closed protocol, right? Also people are disabling AMT, because of listed risks, which is too bad; but it’s easier than properly firewalling it.

      Better to just say “it lets you bring up the console remotely without windows running, so machines can be fixed by people who don’t have to come into the office”.

      • Monument@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 months ago

        Ah, you’re right. A poor turn of phrase.

        I meant to say that intel brands their IPMI tools as AMT or vPro. (And completely sidestepped mentioning the numerous issues with AMT, because, well, that’s probably a novel at this point.)

    • v9CYKjLeia10dZpz88iU@programming.dev
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      4 months ago

      I mean - this is just a giant test of disaster recovery plans. And while there are absolutely real-world consequences to this, the fix almost seems scriptable.

      It seems like it is, I’m not responsible for any computers that had this issue, but I saw this powershell script posted on reddit for a group policy.

      Though, I think some systems had more unique problems, I also saw different steps for repairing an Azure VM.

      There were also that didn’t understand how to get around Bitlocker, and people on reddit posted solutions for that too.


      Though, even with all of this, I was surprised that hospitals had issues. It seems like there’s other issues in deployments, and I saw some people on YC claim this was related to organizations filling checkboxes for regulatory requirements. That they likely had this software because they were concerned with failing an audit. I don’t know if there’s truth to that, but I am surprised there wasn’t more redundancy in critical infrastructure.

      edit: I want to stress again that I’m not responsible for any computers that had this issue and haven’t tried to use any of the above solutions myself. I’ve just noticed lots of people still commenting on reddit not understanding that they can fix this issue with one of these 3.

    • Saik0@lemmy.saik0.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      4 months ago

      I mean - this is just a giant test of disaster recovery plans.

      Anyone who starts DR operations due to this did 0 research into the issue. For those running into the news here…

      CrowdStrike Blue Screen solution

      CrowdStrike blue screen of death error occurred after an update. The CrowdStrike team recommends that you follow these methods to fix the error and restore your Windows computer to normal usage.

      Rename the CrowdStrike folder
      Delete the “C-00000291*.sys” file in the CrowdStrike directory
      Disable CSAgent service using the Registry Editor
      

      No need to roll full backups… As they’ll likely try to update again anyway and bsod again. Caching servers are a bitch…

      • jj4211@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 months ago

        Note this is easy enough to do if systems are booting or you dealing with a handful, but if you have hundreds of poorly managed systems, discard and do again.

        • Saik0@lemmy.saik0.com
          link
          fedilink
          English
          arrow-up
          0
          ·
          4 months ago

          Yeah I can only imagine trying to walk someone through an offsite system that got bitlocked because you need to get into safe-mode. reimage from scratch might just be a faster process. Assuming that your infrastructure is setup to do it automatically through network.

      • StaySquared@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 months ago

        Nah… just boot into safemode > cmd prompt: CD C:\Windows\System32\drivers\CrowdStrike

        Then: del C-00000291*.sys

        Exit/reboot.

      • Monument@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 months ago

        I think we’re defining disaster differently. This is a disaster. It’s just not one that necessitates restoring from backup.

        Disaster recovery is about the plan(s), not necessarily specific actions. I would hope that companies recognize rerolling the server from backup isn’t the only option for every possible problem.
        I imagine CrowdStrike pulled the update, but that would be a nightmare of epic dumbness if organizations got trapped in a loop.

        • Saik0@lemmy.saik0.com
          link
          fedilink
          English
          arrow-up
          0
          ·
          4 months ago

          I think we’re defining disaster differently. This is a disaster.

          I’ve not read a single DR document that says “research potential options”. DR stuff tends to go into play AFTER you’ve done the research that states the system is unrecoverable. You shouldn’t be rolling DR plans here in this case at all as it’s recoverable.

          I imagine CrowdStrike pulled the update

          I also would imagine that they’d test updates before rolling them out. But we’re here… I honestly don’t know though. None of the systems under my control use it.

          • Skimflux@lemmy.world
            link
            fedilink
            English
            arrow-up
            0
            ·
            4 months ago

            Right, “research potential options” is usually part of Crysis Management, which should precede any application of the DR procedures.

            But there’s a wide range for the scope of those procedures, they might go from switching to secondary servers to a full rebuild from data backups on tape. In some cases they might be the best option even if the system is easily recoverable (eg: if the DR procedure is faster than the recovery options).

            Just the ‘figuring out what the hell is going on’ phase can take several hours, if you can get the DR system up in less than that it’s certainly a good idea to roll it out. And if it turns out that you can fix the main system with a couple of lines of code that’s great, but noone should be getting chastised for switching the DR system on to keep the business going while the main machines are borked.

            • Monument@lemmy.sdf.org
              link
              fedilink
              English
              arrow-up
              0
              ·
              4 months ago

              That’s a really astute observation - I threw out disaster recovery when I probably ought to have used crisis management instead. Imprecise on my part.

          • Monument@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            0
            ·
            edit-2
            4 months ago

            The other commenter on this pointed out that I should have said crisis management rather than disaster recovery, and they’re right - and so were you, but I wasn’t thinking about that this morning.

            • Saik0@lemmy.saik0.com
              link
              fedilink
              English
              arrow-up
              0
              ·
              4 months ago

              Nah, it’s fair enough. I’m not trying to start an argument about any of this. But ya gotta talk in terms that the insurance people talk in (because that’s what your c-suite understand it in). If you say DR… and didn’t actually DR… That can cause some auditing problems later. I unfortunately (or fortunately… I dunno) hold the C-suite position in a few companies. DR is a nasty word. Just like “security incident” is a VERY nasty phrase.

      • Nine@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        ·
        4 months ago

        Depends on your management solutions. Intel vPro can allow remote access like that on desktops & laptops even if they’re on WiFi and in some cases cellular. It’s gotta be provisioned first though.

        • catloaf@lemm.ee
          link
          fedilink
          English
          arrow-up
          0
          ·
          4 months ago

          Yeah, and no company in my experience bothers provisioning it. The cost of configuring and maintaining it exceeds the cost of handling failure events, even on large scales like this.