CrowdStrike's Incident Report

Episode 299 –

Description: This week on the episode, we walk through CrowdStrike's preliminary post incident report to understand exactly what happened during the July 19th outage and what all software vendors can learn from the event. After that, we cover a clever plot that lead to KnowBe4 hiring a North Korean threat actor. We end with some research from Wiz on Artificial Intelligence tenant isolation. 

View Transcript

Marc Laliberte  0:00  
everyone welcome back to the 443 security simplified I'm your host Marc Laliberte and joining me today is

Corey Nachreiner  0:07  
Corey kernel mode sorry I had a blue screen of death Nachreiner.

Marc Laliberte  0:17  
pretty good as far as hinting at, on today's episode, we will go through CrowdStrike post incident report describing the IT security event that took down eight and a half million endpoints last Friday, maybe I guess two Fridays ago. At the time you're listening to this. Now we'll then talk about how no before hired a North Korean threat actor accidentally. And we will end with a quick research blog from our friends at whiz. Wise wits. It's definitely words, do some research.

Corey Nachreiner  0:47  
It's not Cheez Whiz, by the way, but pronounce it

Marc Laliberte  0:52  
was going to be Google whiz. But that does not appear to be the case anymore. Research into artificial intelligence that they've been doing and which they will be presenting at at BlackHat. With that, let's go ahead and whiz our way in.

Corey Nachreiner  1:05  
That could go wrong. cheese cheese wizard way is a safer way.

Marc Laliberte  1:17  
So Corey, how was your Friday? On the 19th?

Corey Nachreiner  1:22  
Yeah, without you and we're doing a podcast. That was fun. Thanks to. But actually, my Friday while I might have been on the news talking about this, you lived it firsthand. I hear so both are Fridays were fun, right? Yeah,

Marc Laliberte  1:36  
I was in Seattle on vacation that week before and Friday was my day to try and get back home to Austin, Texas. And what started in the morning is Oh, man looks like they're about to have a debt. Bad day turned into Oh, man, I'm about to have a bad day as my flight was. Luckily only delayed for hours. I think I was lucky and that there's a special this like time period. On the morning of the night incident we're about to talk about. Yeah, where we're Delta had still had all their pilots in the right places. They had all their airplanes in the right places. So that like first round of flights made it out. And then the second round of flights did kind of okay, and then it all just like went off the rails. By the time I came back to Seattle on that following Sunday, I was worried, right seven hours and like one of only 10% of flights at Austin that actually took off from Delta's Yeah. Holy crap.

Corey Nachreiner  2:34  
It turns out if you lose almost all your computers, and each one needs to be hand touched to recover, you have a bad day of organization that might get cascading li worse as the timing

Marc Laliberte  2:44  
exists. And it turns out, even if you bring the critical systems back online relatively quickly, if only like 5% of your workforce has working workstations, it becomes really difficult to use that system still effectively. So man, last Friday, which you already talked about on the last podcast without me and again, all over the local news here in Seattle. Last Friday was nuts from the world of information technology, where we

Corey Nachreiner  3:10  
should say we're talking about CrowdStrike, we literally have not said CrowdStrike it we were talking about CrowdStrike. Again, everyone. Yes.

Marc Laliberte  3:18  
And now that the dust has settled a little bit. For some organizations, the dust is still very much in the air for Delta. Despite their CEO arriving in Paris this morning, as we're recording, we have at least started to see some of the details from CrowdStrike on like, what the heck actually happened in this case. And like To their credit, they released a very detailed and very transparent post incident report. So far, they're calling it their preliminary one. It's got some like high level but still somewhat detailed information about what happened to cause eight and a half million Windows computers to blue screen and require manual intervention to recover. So because this is the biggest it incident in the last year, a couple of years, I think it's worth going through and seeing what all of us can learn from what CrowdStrike is disclosed as a part of this incident. So quick background again on Friday at around four in the morning, and lasting for about an hour and a half. Any Windows machine that was running CrowdStrike sensor version seven dot 11 and was connected to a network downloaded a content configuration update that ultimately triggered a Windows blue screen of death system crash. And like we were chatting about at the start of this quarry. Like when I got to SeaTac Airport, even the ticket printing machines that delta uses like the little check in kiosks. All of them were showing the blue screen of death or recovery. All of the television monitors throughout the airport were showing it just like the pictures you're showing on the video right now. It was honestly surreal, just the immediately obvious impact of this incident in across the world. And that's

Corey Nachreiner  5:09  
one airline, it turned out lots of relatively large organizations were using this product. As you mentioned, the number seems to be 8.5 million endpoints out there. And I think one of the things I may not have done well on the news because I talked about endpoints like computers or servers, but I think I might have done better in the podcast is that people don't realize how many kiosk based machines like yet like your baggage check, even if you've checked in online, your baggage ticket thing, the thing at the airline, your train monitor, telling you to schedule, an ATM machine there, there are many devices that modern e PDR, or EDR solutions, which is like what CrowdStrike makes, and what we make here a watch card. They're meant to be on any device that supports the operating system, Windows, Linux or Mac, because they all need protection, ATMs are being attacked, if they're running Windows, etc, etc. So I think the companies that did use this, you don't really realize the full fallout of something. It's not just your employees, workstations, as you mentioned, although those are a big deal on their own, and your servers, but it's all these kiosk machines that are the modern conveniences that we that are operational technology, we don't really, you know, I guess technical people realize their computers, but a lot of people might think they're purpose built machines, but they're just running Windows and apparently other things. Yeah.

Marc Laliberte  6:33  
And it was surreal, just immediately seeing how many Windows machines had CrowdStrike installed, and were connected to the internet to get their updates, it was nuts. So before we go through, like CrowdStrike analysis of what happened, I think there's a few things we probably want to explain for folks that aren't Windows developers, or nerds like us that keep up to date on things. So real quick in Windows, and I guess in every operating system, there's a separation between what's called user space and kernel space. Basically, programs that run in userspace are given virtual memory allocations. And they aren't able to directly interact with other memory of other processes except through API functions that are exposed either through the operating system itself or a driver. Drivers operate in kernel mode, and can see and interact with memory from any process on the entire system. Device drivers, the things that interact directly with hardware need to run in kernel mode in order to access the hardware itself. So for example, the driver for your storage disk, the driver for your optical disk, if you still got a DVD player, all of these need to interact with higher hardware, so they run in kernel mode. When a user space program crashes, the program itself just crashes and exits. You've probably encountered this several times throughout this year with random browser crashes or random applications that hang at the end of the day, you lose whatever work was unsaved in that application, but the operating system chugs along just fine. In kernel space programs, when they crash, the entire operating system effectively halts, and on Windows, you get the blue screen of death. On Linux, you get the black screen of death and on Mac, I think it's like a magenta or something. But it's important because an issue in kernel space that could potentially corrupt memory could affect the memory of any program or every program running on the system, including that underlying OS. That is why a crash in kernel space causes that blue screen of death and halts all execution. Corey

Corey Nachreiner  8:42  
I agree, I will get to why you're talking about this. I will say by the way most most programs shouldn't really have. There's a lot of people out there that think other than OS makers and various specialized hardware makers like graphic drivers and stuff. Normal programs should not have a kernel driver. You shouldn't be mucking around there. But I will say AV anti malware, AV EDR, whatever you want to call host based security, host based security is one of the things that needs an exception, a lot of attacks are happening, like they're specifically rootkits meant to do things at a kernel level instead of a user level to evade certain detections. So being able to load security type things as a kernel driver is pretty important. So you might hear programmers talk about, you know, secure to add for security, operating systems and uptime, you really should avoid kernel space when possible. And there's a reason they have a separation of privilege. But I will say security host software having a kernel driver is not unusual. I will get to Marc I mean, I know why you're talking about this, but this issue wasn't in the kernel driver. I think we we have theories about how the binary file might have been loaded by a kernel driver, but though those are actually being still argued online. So according right to the blog post we're talking about is the dot sis file in question is not a kernel driver. But I assume you're talking about this mostly because we suspect, you know, did they talk about what update caused the issue and the that we're going to get into, but we suspect it might have to do with how that particular update is loaded by a kernel driver, right,

Marc Laliberte  10:27  
we'll get to that they actually describe exactly how it interacted with their kernel driver. So to your point, though, endpoint security products need to see all memory on a system to look for rootkits, or just malicious programs, or even legitimate programs that have been compromised and behaving unexpectedly. On Windows, you can do that through a kernel driver on other operating systems like Mac OS, you actually interact through API's, that Mac exposes Mac does not allow third party kernel drivers. And that is a security trade off. And that if whatever you're trying to monitor isn't accessible by that API, it is hidden from the endpoint protection. So it's not a

Corey Nachreiner  11:06  
very good one bad and it's up for debate, I forget who or my favorite Mac researcher is he talks at every DEF CON, but the good thing is they at least do are starting to expose some of it via API. So they're giving a more secure way to look at it. But he the guy, I'm thinking of complains all the time that there's certain things that their API is not allowing access to and it is limiting security programs, it's limiting detection of some of the hacks he finds. So you know, they're doing it for a secure reason. But then they're also obfuscating external security software from being able to help you.

Marc Laliberte  11:41  
Yep, so on Windows, there's a entire program, they call it the windows hardware quality labs or WH QL. that's designed to let driver developers test their drivers put it through certifications, make sure that it basically doesn't break anything on Windows. And then at the end of it, Windows cryptographically signs, basically their stamp of approval on that driver to say this is certified to work with X version of Windows operating system. Now that program, it takes time to go through it, you have to do the tests, you have to submit it for being signed by Windows. So it's not something you can really do on a rapid frequent basis that you might need for endpoint protection, because any changes to the driver code itself, any modification has to go through that certification process, again, be tested, be certified, signed, and then you can deploy it. So with endpoint security, no evolving threats change every day. And you need to stay up to date on whatever the latest techniques and tactics threat actors are using. And being forced to wait several weeks potentially to go through the certification process would make it extremely difficult to handle evolving threats. So some vendors handle this in different ways. I'd say WatchGuard handles it differently than CrowdStrike. CrowdStrike handles that differently than Bitdefender or Sentinel one, everyone's got their own ways to be able to deliver fast updates. And in this case, we'll talk about CrowdStrike because that is where this issue arose. So quick primer that CrowdStrike strike gave and their blog post is how they deliver content to their EDR tool. So CrowdStrike uses packages that they call sensor content to identify adversarial activity. In their report. They said the sensor content packages can include things like a machine learning model or executable code that let them reuse capabilities by more rapid response content that we'll talk about in a second. They talked about this concept of a template type, which is basically a predefined, like bit of code that they can instantiate into specific detections later on the template type. It's got Predefined Fields for threat detection engineers that lets them write template instances, to combat specific threats. So sensor content, including the template types, are a part of the driver that goes through extensive QA process that includes testing fuzzing, manual automated testing, validation, and a staged rollout. And that's available to CrowdStrike customers either as the latest release and minus one or n minus two

Corey Nachreiner  14:24  
can can I put this censored content in basically dummy speak or at least this type of update and this feels more like a full sensor update. Like if you are getting a point point release of the agent, whether you're calling a sensor agent, whatever the application is, this is a normal full release. So this is the kind of thing that happens at best monthly, maybe even quarterly. And you'll talk about how this is fully tested. But to your point what we're getting to is the other one which is just think of them is the every day at least daily, if not hourly. He updates that you have to keep rules, behaviors, detections. It's a constant cat and mouse. So like Marc was saying there's types of updates endpoint software needs that have to happen regularly. But this first one, if it were a full agent update, this is what would happen. This wasn't a full agent update. This was more about one of the more daily Hourly updates, right. So

Marc Laliberte  15:21  
sensor, sensor content, but they call it that as a part of the driver that goes through all of the testing and validation and fuzzing the certification. And it's rolled out in a controllable manner by users on a, like you said, monthly basis potentially. CrowdStrike also releases what they call rapid response content, which is stored as just a binary file. It's proprietary. And it contains configuration data, so not executable code. That's the sensor content packages. This is the configuration files that tell that executable code what to do in order to catch or evaluate or monitor a specific type of threat. So they in their blog posts, they specifically said that this rapid release content is not code. And it's not a kernel driver. It's delivered as what they call template instances, which are instantiations of a template type. And map specific behaviors for that sensor to observe, detect, block, whatever. They deliver these rapid response contents regularly, sometimes multiple times a day. And they are delivered as dot sis files, but like you pointed out, they are specifically not a driver, even though they use a dot sis file extension.

Corey Nachreiner  16:41  
The reason we point this out is there's a lot of analysis online trying to smartly reverse like the memory condition that cause this, although there's a well known post, but then another post that shows a very different situation. But they a lot of people assume because it was a dot sis it was a kernel driver. And there wondered why half of it was zeroed out? It's because it's not it does contain instructions to do something which Marc we'll get to. But there's a lot of people trying to technically reverse this. And I've already seen some competing answers.

Marc Laliberte  17:14  
Yep. So it's not executable code. It's a configuration file that tells a instance of that driver or code what to do. So CrowdStrike, they deliver these rapid response content updates regularly, sometimes even multiple times a day. New Template types, the executable driver code itself. So remember, this is a template instance, versus a template type. The template types are stress tested against what they say, quote, any possible value of the associated data. Template instances, the rapid response content, however, only go through a content validator that performs validation checks on the content itself before it's published. So this is, I think, good place to pause and get into exactly what they're saying here. So like the driver updates, the template types, they're dynamically tested, they obviously do some code checks themselves, they probably have a static application security testing tool. But they also put it through deployment stages, make sure that it doesn't break on certain versions of Windows, these configuration files, the template instances, they run just a static check against them to say, Okay, for this field, is this value allowed? Yes. Okay, then continue on. And assuming everything is green, then they can go push it out into production.

Corey Nachreiner  18:34  
And maybe they'll get into this in the timeline. But to like, like, translate this the way you described it, to me to this is kind of the crux of the problem, in that, you know, like us for the full agent release, they might have a full full set of tests, internal beta, early adopter, beta and staged rollout. For these kinds of updates. It seems like it's blasted to everyone if it passes this validation. So if that validation misses anything, it happens. It still goes to everyone. Yep.

Marc Laliberte  19:08  
So here's the timeline that they gave. In February of 2024. They released sensor version seven dot 11, which included a brand new template type, so executable driver code, they pushed it in their staging environment, and then pushed it out to production. This new template type it was designed to detect attack techniques that have used named pipe named pipe. So like inter process communication. Throughout the blog, they call it the IPC template type. So if you read it yourself, that's what they're talking about. So that was February 2020. For the new sensor version, they now have this new template type executable code that they can use on March 5 2024. So the next month, they ran their stress test against that new template type in their staging environment, and containing a bunch of different operating systems, and it passed that stress test. So this code was already in there, they weren't actively using it. They didn't have any configuration files that instantiated this, but it was in there. And then they tested it a month later, and those tests came back. Okay, we're ready to start using this. So on March 5, also, that same day, after that successful stress tests, they released their first template instance, for the new template type. So basically, they tested the configuration file through their battery of tests, this configuration file and the brand new template type passed. So they released it to production. The next month and April, they released two additional template type instances, or sorry, template instances, got to make sure we get the error wording are absolutely correct. On April 8, and April 24, for that new template type, and both of those worked without issues. Now end of April, we've got the new code, the template type, and we have three configurations, the template instances that are all working without problems, the first one went through the battery of stress tests, they think that they're home and clear. And

Corey Nachreiner  21:02  
to make sure I'm understanding, right, and maybe the audience, this testing is of these template types. But you know, originally, they weren't actually testing rules that would go to those template types. But when they got to the stress testing, and this one, the channel file they talk about that is that actually, that is the concepts that so at least

Marc Laliberte  21:24  
go ahead. That's the template instance, their channel files or the configuration file.

Corey Nachreiner  21:29  
Yeah, so that they're they're starting to test the types of things they're sending to the templates on an ongoing basis.

Marc Laliberte  21:36  
So now fast forward to July 19, they had two additional template instances that they wanted to deploy out. So these are the configuration files for this code and the driver, they can execute and look for things. Both of them passed their validation tests. So they ran their static validation tool against them to make sure the values in these configuration files are okay, both passed. But unfortunately, there's a bug in the validation, where technically one of those instant types should not have passed validation. And so here's where the meat of the issue comes in. And they've got a quote here. That said, based on the testing performed, before the initial deployment of the template type in March, our trust and the checks performed and the content validator and previous successful IPC template instance deployments, these instances were deployed into production. So basically, because they test the code itself that like the driver bit, because they believed that their validation for these configuration files was okay. And because they hadn't had a problem yet, even including in their stress test, they pushed them straight to production, and where they became active. And when that sensor when it was received by the sensor and loaded into the content interpreter, the problematic contents and channel file 291 resulted in an out of bounds memory read triggering an exception, the unexpected exception could not be gracefully handled, resulting in a Windows operating system crash the blue screen of death. So this is it's interesting. So it's this boils down to there was a bug in their driver, that this configuration triggered, and their earlier testing didn't find that bug for whatever reason. And it was only kind of faulty config got pushed that it triggered the bug and caused the crash.

Corey Nachreiner  23:32  
And technically, this is a driver that's only getting updated, not on these rapid response things that that driver is probably part of their sensor content. February Yep, yeah, it's just a driver bug that none of their other testing caught. And it wasn't until their their rapid respond, you know, this particular template instance update hit it that the bug showed itself. But the weird thing is, you know, it's the fact that their detection updates are triggering bugs in the main sensor product was the big deal, you know, and the main product really needs QA testing, at a much wider level, but the content updates are not going to get that level of testing. So you really need some how to have good validation so that you know, you're not triggering sensor bugs. Yeah,

Marc Laliberte  24:26  
it's interesting, like, in theory, like their stress testing should have found this. Like, we don't know exactly what was in that content configuration file, the template instance we don't know. Like, was it a field that was left empty and because it was left empty, it triggered this bug and the driver. But if they if their stress testing was as thorough as they hoped it would and in theory should have caught this issue. But so this it's it's interesting because the spirit of like the Windows driver program is supposed to be any like anything that could be a meaningful code change to a driver. should go through validation again and get recertified. And that isn't really compatible with how endpoint security needs to work with rapid iterations. And so this is kind of like skirting the spirit of that program where these config files clearly can impact the driver and trigger a kernel level issue.

Corey Nachreiner  25:18  
Question is, are they Is it a sneaky way that they're literally changing the code of the driver? Or is the driver actually remaining unchanged, but it has a parsing error that when it parses whatever the code is, it's running into a bug that just wasn't caught. So I get your they could be bypassing the spirit of it. But I'm still not sure that that binary dot sis file, however they're delivering that they're obviously delivering behavior updates to a driver. But is the driver parsing it as a rule without adjusting its code? Or is the driver code somehow been patched as part of the update? You know what I mean? I think for me, that makes the

Marc Laliberte  25:59  
former, it doesn't sound like they're updating the driver code. It sounds like it went to load this basically configuration file, and then just crapped out while trying to load the config.

Corey Nachreiner  26:09  
So to some extent, I feel like there may not be because it's the driver itself that they shouldn't update without the process. And I think all drivers need to be able to parse code. So this could also be described as a very hard to find bug in the driver, like the driver, not only pass their processes, it passed Microsoft's processes. And I, by the way, I'm not forget, I'm sure they have learned from this, and they'll update their testing process to find this no matter what they should learn an update from this. I'm just saying, I think we've experienced enough bugs that sometimes you get things that aren't simple memory overflows and aren't simple. Lee triggered, you know, no pointers are off by one. They're ones that happen in really unique situations that are hard to find. So I get where you're going with the potential or the are they getting past the spirit of Microsoft's kernel driver testing. I'm still my personal opinion is I'm still open there, I just think they need to figure out how to test these more regular updates better so that they don't trigger relatively hard to find bugs. Let's

Marc Laliberte  27:20  
talk about that testing. Real quick, I want to give an example of like how our security team manages our sim. So we use detection as code. Basically, all of our detection capabilities are managed in individual Yamo formatted files that include like the search string that we use in arson metadata about what should be included in that detection. If it's triggered, things like that. During deployment, these individual Yamo files are converted into just a monolith configuration file for our sim, which we then load in. And that becomes the new detection suite for it. If there was a formatting issue in our Yamo files that could be translated into an issue in that count file that would break the whole thing and break all of our detections in that file if our sim couldn't load that and start working on it. And so to make sure that doesn't happen in our deployment pipeline, we've got static validation for those Yamo files as part of the build process that for example goes okay, is the file itself a Yamo? File? Yes. Okay, check number one passed. Now for each field? Are the values in that field matching up to what we expect for that field? Does it match a regular expression? Does it match a predefined set of like specific, low, medium, high or critical severity and nothing else? So in

Corey Nachreiner  28:36  
other words, that's a form of bounce checking, right? If you're essentially giving input to a function that's doing something with it, you're at least making sure the input you're delivering is within the bounds of something you expect it to be so that it's not something that would trigger whatever is parsing it to suddenly go crazy.

Marc Laliberte  28:52  
Yeah, so this is it's a static check of the file itself, before we load it into the configuration file. But like, we don't want to rely just on those static checks. Because by nature, it's checking for scenarios that we expect to happen. And we obviously we include scenarios that we think we know would break it if it was in there. But we don't know everything, there is

Corey Nachreiner  29:12  
actually a great like, everyone's here at points. So far, we're kind of on par with, we're not above their testing, because they did have a static, I forget what they call it, but a static check of this particular content file, just

Marc Laliberte  29:27  
making sure I'm clear. Again, this is like our internal security team talking about our like sim

Corey Nachreiner  29:32  
or our product.

Marc Laliberte  29:35  
So but because we don't trust that or we don't trust that, you know, we aren't going to make a mistake and that YAML file that we didn't think of that would cause that whole thing to break. We don't just push that comp file directly to our production sim. We first install it in our staging sim like our testing sim, and we make sure that cannot load the comp file. Do all of our detections show up as we expect and can we enable them like basically does anything catastrophically brake. And only then do we deploy it to our production instance. That is the step that I think CrowdStrike was missing in here where even though their configuration like Validation Tool returned, okay, if they loaded this into, you know, a Windows 10, or Windows 11, machine running sense of version 11, it sounds like in theory, that machine would have immediately blue screen, and that would have been a red flag. And because they would have known push this to staging first and push it straight to prod, that is what caused this issue. Ultimately, if they

Corey Nachreiner  30:31  
just had it on one active machine, before they push it to everyone, they would have seen it when it actually ran a dynamic test. By the way, just because you were giving the example of how we check updates in our internal sim and our sock audience out there probably wants to know a little bit about our endpoint process. You know, obviously, this CrowdStrike thing didn't affect us at all. But I just wanted to point to if you join the WatchGuard product blog, and you go to our partner blog, Guillermo are basically head of the endpoint product management released something that talks about both of our processes too. And by both I mean, the actual full agent update process, which has a very rigorous, you know, amount of testing and staging, and pre beta and early adopters, and then staging rollout. But we also do go into other things on actual the update process and the content update process, too. So we talked about for EPR, and at 360, what types of testing we do when we're delivering the daily or the Hourly updates to our product. So if you are interested in how our endpoint handles updates, definitely go check out that blog post. Now,

Marc Laliberte  31:45  
so at the end of the day, I think we can, like, summarize this as this was an issue of relying on static code validation, versus dynamic testing of a change, and CrowdStrike. To be clear, in their blog post, they have already discussed that they learned their lesson. And they're going to change some of their deployment processes in the future to try and catch some of this. And to be clear, this could and will happen to every software vendor out there, maybe not to the extent that we saw on the 19th. But bugs happen, like testing sometimes doesn't catch everything. I

Corey Nachreiner  32:19  
was sure, yeah, catastrophic. We've seen as simple bugs as a malware update. And old AV vendors have blue screen Windows machines simply because they tried to quarantine and critical windows. The luckily for some of those cases, they just didn't cause as much global strife as this one. So this does seem like the largest instance I can remember of AV related plug that crashed windows from, you know, really affecting everyone. But there have been plenty of other anti malware bugs that have crashed windows.

Marc Laliberte  32:50  
Yep. And this is also like last point, and instance where being more secure ended up causing problems for organizations like going back to Delta only have to hop on the flight again tomorrow. One of the reasons they're struggling so hard to recover is because they followed good data security practices, and they deployed BitLocker disk encryption literally everywhere, even there, like user kiosks to go print out your boarding pass. And unfortunately, because they were using BitLocker, you can't programmatically remotely recover these machines, you have to physically hands on keyboard or virtual keyboard safe mode, enter in the BitLocker key. And then you can go in and delete that problematic dot sis file the configuration file and recover.

Corey Nachreiner  33:36  
By then we can share the story we talked about this morning. Like another security thing is you should be patching and keeping all your servers up to date. But what was the company that basically had

Marc Laliberte  33:48  
there was one I saw on Twitter, that was pretty great, where they said they were saved from this because the first system that got the update was their DNS server. And it went down and broke DNS resolution for their whole organization, which meant none of the other systems could get their updates, and they were all saved. So

Corey Nachreiner  34:05  
by the time they fix their DNS server, the other systems got the fixed updates. Yeah, sometimes crashing something else, or having an old product that's not affected saves you that is not an excuse for keeping old project products though or

Marc Laliberte  34:20  
No, not at all. But man, I think this is at least like this isn't a throw shade at CrowdStrike this is something all of us any software manufacturer can and should hopefully learn from even though it caused so much pain, just something

Corey Nachreiner  34:34  
I would say if it's especially if security related software manufacturers that have things that run in any kernel mode kernel mode is there are few instances it's it's required. But if you're a software creator that doesn't need it, leave it alone. But if you absolutely need it, like if a network driver or I'm sorry, like a graphics driver could have a bug like this, so everyone in kernel mode should be learning from this Yep.

Marc Laliberte  35:02  
So anyways, let's move on to the next story then. This one was pretty cool. So I stumbled across this story on like our slash cybersecurity. I think it was a post by no before, and the title immediately caught my attention, because it's all about how they hired a threat actor from North Korea unknowingly. So, they did preface this article with a little paragraph saying that no illegal access was granted, no data was lost, compromised, or exfiltrated. And that the whole point of this article is just to talk about a incident that they had, and so we all can learn from what they learned along the way. So this all started when no before wanting to hire a software engineer for their internal eyeties Artificial Intelligence Team. They found a qualified individual HR completed for video conference based interviews with them confirm that the individuals photo matched what they had on their application, that their background check and everything cleared. They later found out that this is because the individual had stolen someone else's a valid us identity. And they had modified a stock image with artificial intelligence to make it look professional. So either way, they passed all the HR checks, they hired this employee, it set up their new Mac OS workstation and shipped it out to them. But as soon as that Mac workstation came online, it immediately started trying to load malware. So this all started at 955. Eastern on July 15. When no befores security operations team detected that anomalous activity. They called up the employee to try and say, hey, you know, what the heck's going on the sock already suspected that this was intentional, and they might be dealing with an insider threat. The employee responded by saying they were following steps on their router guide to troubleshoot a speed issue which may have caused the compromise or sketchy activity. Along the way, the attacker was trying to manipulate session history files transfer potentially harmful files to the machine, and execute unauthorized software. They're using a Raspberry Pi to download malware onto the device. The SOC tried to get more details from them, they tried to get them on a phone call. But he said he was unavailable for a call and then later became unresponsive. And the SOC ended up isolating the system 30 minutes after this started at 10:20pm. Eastern on the same day. So they later found that this employee had this North Korean internet based individual, it had that laptop shipped to what's called an IT mule laptop farm, where they then basically a physical location in the United States, where they could then connect with the VPN to access the device remotely. And work. They even worked like night shifts and North Korea to make it look like they were in the United States. And it allows them to skirt around. I guess international laws and in this case, potentially even do work get paid well. And they give a bunch of money back to the North Korean government to Phil to fund their illegal programs. This is pretty nuts. And honestly, this kind of scared the crap out of me when I saw it. Because I it's tough to how do you protect against this, like they passed for video interviews with HR, they pass the background check. Like everything checked out clear until they the sock caught them doing malicious stuff on their laptop? Is no before they had a couple of

Corey Nachreiner  38:35  
go I it's I feel like it's very hard to to to, to not accidentally hire an employee that's going to that level of faking it for HR. I would hope i If I were no before, I would kind of go after the background check though. Because if if some of their identity was fraudulent, you would assume the background check company could catch it. But other than that, that's hard. I do want to give kudos for like, it's going to be impossible to find, like there could be legitimate us employees that have bad intentions. And there, they definitely could go through a background check. You know, maybe it's their first time deciding they want to, you know, install ransomware to get a little percentage from some ransomware author but before they had the clean work history is hard to catch those. But to me, the real cool part about this is how quickly this saw, you know, the fact that they had a sock the fact that they had EDR software, the fact that they were monitoring the devices so well have their own trusted users shows that they at least as an organization follow some of the best practices including zero trust practices. I hate to say it because I feel like we trust our employees at WatchGuard. And so far we're lucky with really good employees but the whole idea of zero trust is to monitor your own people too, and make sure they have limited privilege. Imagine. So I feel like if it wasn't for all the great sock stuff they're doing that maybe some businesses aren't, this could have been a really bad situation. So how to stop the bad insider from getting in? That's the question. I don't even know how to answer without psychology, and better background checks. But as far as how to catch it, I think this is a great use case for how you can catch this type of thing. After the fact at least,

Marc Laliberte  40:24  
they had some recommendations like better vetting, there were some inconsistencies when they went and looked back at like birthdays that they provided both in some areas versus other areas, and career inconsistencies as well to they recommended making sure that people are physically where they're supposed to be. And that like the, if the laptop shipping address is different from where they live or work, that's a red flag. Look for things like VoIP phone numbers, and a lack of digital footprint and contact info versus like a cell phone number that's allocated from T Mobile or whatever in the US. And then as a technical protection, make sure you don't allow remote access remoting into devices, remote devices.

Corey Nachreiner  41:06  
Yeah, I agree with all that. By the way. One last thing before we go off it. On my screen I was showing, it feels very similar to me something you and I saw, this is an entirely different case. But to show you how big an insider threat issue is, and how it may not just be these guys that are faking identities. Marc and I saw at the FBI CIA so academy that the FBI is trying to push out for people see is this made in Beijing the plan for global market domination. And at the highest level, if you go check out this video, this is a case where an actual Beijing Chinese citizen came to us for schooling went through schooling for in this case, I think, genetic, you know, genetics in agriculture, and got a job at a high end agriculture, you know, a place where there GMO and different types of plants and stuff, and worked there for 20 years, and literally was just working as an employee that actually was a key employee in some of the stuff that this company did. But meanwhile, for that entire 20 years, they were connected to the Chinese government. And their goal was to actually steal intellectual property which they were sharing. So if you're curious about how insidious the insider threat can be, unlike this one, which is what under that was kind of identity posing like fraud, mixed with a very quick catch. This is one where they actually go through the steps of developing a valid working us identity. But they're still associated with the country that they originally started in decades ago. And they're they're literally a malicious insider. So recommend it for sure.

Marc Laliberte  42:54  
So moving on with the last story, and this is more of a preview of something I'm excited to see at BlackHat this year, just a couple of weeks, researchers from whiz wise I think it's wiz published some of their findings last week on what they call a tenant isolation research on multiple artificial intelligence service providers. So in the post, just this last week, I guess they're gonna give a talk at it and blackouts called isolation or hallucination hacking AI infrastructure providers for fun and weights. But in the Post this week, they published their vulnerability findings for saps artificial intelligence offering called SAP AI core set the main takeaways. Yes, the main takeaways were, they found vulnerabilities that could let them read and modify Docker images on their internal Container Registry, modify Docker images on their Google Container Registry, read and modify artifacts and their Artifactory server for their development pipeline, gain cluster administrative privileges on their AI cores, Kubernetes cluster, and access customers Cloud credentials and private AI artifacts. And I won't go through all the details on this, we'll probably I'm definitely going to go to their talk in a couple of weeks. And we'll give some more details then. But at a high level, it was really interesting the path they took to escape out of this system. So SAP lets customers run their artificial intelligence projects as a Kubernetes pod in a container, or a cluster that SAP manages. Those containers are restricted by default. In fact, network access controls are controlled by something called sto, which is like a meshing network extension for Kubernetes. Containers aren't allowed to run as root, for example, but the first thing they found was that you can have your container run as any user ID on that system. And while you couldn't run as root, you could run as stos user ID, which then let them run as that user and bypass the network protection rules they had, which basically opened the door into the entire virtual Kubernetes network. So they started scanning the pods internal network, they found things like a griffin On a Loki instance, that exposed AWS authentication tokens. They found other customers pods were exposing a elastic file system EF s shares, which allowed them to read sensitive files from those customer devices. They found the helm package manager for this Kubernetes cluster was unauthenticated, which is a thin query and pull credentials out of that helm server, including credentials Read and Write for their Docker registry and their Artifactory server. They found expose secrets from saps beyond their AI core things like their AWS credentials, data lake access Docker hub. So from a allowing customers to run their own containers and something you think you control. One mistake, one chink in that armor allowed them to basically gain access and visibility into the entire system. It was pretty, pretty nice.

Corey Nachreiner  45:54  
And as we consider to try to figure out the security risks with these AI models, you know, one of the things we're asking people to look for is if you pay for something, at least you get your own tenant. But the whole point of this is it looks like AI is so nascent is in this being and it is evolving so quickly that we're building systems that have mistakes where your own tenant may not be as secure as you think it is essentially.

Marc Laliberte  46:18  
So this was one example they supposedly researched a ton of different organizations and AI providers. I'm very much looking forward to this talk at BlackHat this year, and while included on one of our recap episodes, but man I said I got a flight home to Austin and about 24 hours as we're recording this I sure as heck hope that there isn't another catastrophic it issue that causes that one delayed as well to what a crazy crazy week this last week was.

Corey Nachreiner  46:47  
It was fun.

Marc Laliberte  46:50  
You and I have different definitions of fun.

Corey Nachreiner  46:54  
Or I you just didn't catch my sarcasm on the fun

Marc Laliberte  47:01  
everyone Thanks again for listening. As always, if you enjoyed today's episode, don't forget to rate review and subscribe. If you have any questions on today's topics or suggestions for future episode topics, you can reach out to us on Instagram at WatchGuard underscore technologies. Thanks again for listening and you will hear from us next week. Why are you thumbs down?

Corey Nachreiner  47:21  
I was saying yes. The Thumbs Up no thumbs down.

Marc Laliberte  47:24  
Okay, got it.

Corey Nachreiner  47:28  
You rate us how you want. But Marc was send a beer to everyone that gives us five stars.

Marc Laliberte  47:35  
I'm expensing those beers