Detailed Notes
What should you do when software fails?
In this episode of Building Better Developers with AI, Rob Broadhead and Michael Meloche explore what really happens when systems crash, production goes down, and disaster strikes. Originally inspired by the episode When Coffee Hits the Fan, we walk through real developer mistakes, recovery strategies, and the tools that can help you respond fast when failure hits.
✅ Key Topics Covered: • Real-world developer outage stories • Common mistakes that cause production failures • How to build your own recovery strategy • Tools like Docker, Terraform, GitHub Actions, and Chaos Monkey • Blameless postmortems and communication during outages • Why mindset matters when things go wrong
🛠️ Whether you’re solo or on a DevOps team, this episode will help you prepare, respond, and recover the next time your software fails.
⸻
🎧 Listen to the Podcast: https://develpreneur.com/what-happens-when-software-fails/
⸻
🔗 Links & Mentions: • Docker: https://www.docker.com • Gremlin (Chaos Engineering): https://www.gremlin.com • GitHub Actions: https://github.com/features/actions • Terraform: https://www.terraform.io • Sentry: https://sentry.io
⸻
📌 Connect with Us: • Website: https://develpreneur.com/ • LinkedIn: https://www.linkedin.com/company/develpreneur/ • Twitter / X: https://X.com/develpreneur • Facebook: https://facebook.com/Develpreneur
⸻
#SoftwareFails #DevOps #DeveloperRecovery #DisasterRecovery #BuildingBetterDevelopers
Transcript Text
[Music] That was a recording in progress. All right, this was going to be fun. Uh, diving right in. We were just whining before this, so we're trying to adjust here and get back to being normal people and not the whining little pansies that we are. And yes, we were whining about work. So, welcome to our world is the same as your world probably. So, uh this time we're going to continue. This is the when coffee hits the fan real talk on developer disaster recovery. Um, this is actually interesting because I already threw it in there and the first thing came back is that's a brilliant and catchy title which I think was generated by itself. So, it's like hiding itself on the back while it's like, "Wow, that's awesome. I'm glad you sent me one of those or you used our title." All right. Well, let's dive into it and see how it goes with a little three, two, one. Well, hello and welcome back. We are continuing our season where it is with AI. Yes, we are building better developers the developer podcast. This season we are going back through prior season going through some of the episodes and basically shoving those into an AI engine uh specifically chat GPT and seeing what it gives us back as its recommend recommendations just to sort of see how AI does stuff. Before that I need to talk about how I do stuff. Who am I? I am Rob Broadhead. I'm one of the founders of developer developer also a founder of actually the founder of RB consulting where we are a we are you know it's a lot of different ways to refer to it sometimes we're called a fractional CTO CIO sometimes we' be referred to as boutique consulting the bottom line is we sit down with our customers we work through their business talk about what are they doing what are their processes what are their goals and their vision and then we look at what kind of technologies do they have what kind technologies are out there for them. And we will use a whole bunch of different tools, whether it's simplification, integration, automation, innovation, any of those things and more we will use to help them craft a essentially a recipe for success that is unique to that company because everybody's got a little bit of a, you know, different take on stuff. You got different staff, different needs, different resources. We help mirror those or actually marry those things together, provide a technical road map and then can help you implement it or can let you go on and just on your merry way following that road map. Good thing bad thing uh very much near and dear to my heart right now. Good thing is we are going through and updating the town home that we just recently got. We got a whole bunch of little things too. So that's great. That's awesome. And I've got a wife that does that stuff. So she's off doing all that cool things. Bad thing is is that sometimes she can't get some of those things done because it needs as she calls it somebody with muscles. And yes, I happen to have a couple. So, I had to actually come to the townhouse, which we don't have, which is really the bad thing. We don't really have Wi-Fi here. So, I'm working off my phone. So, I may not have the best connectivity this time around, which is okay because at some point that means I'll just block have to see him. But you will if you're on the YouTube. but more importantly let him introduce him hopefully dig himself out of the hole that I just created for him. Apologies. >> Hey everyone, my name is Michael Malash. I'm one of the co-founders of Del Developer. I'm also the founder and owner of Envision QA where we help startups and growing companies launch better products faster. That means fewer bugs, smoother customer experience, and less wasted time and money. We take care of the behind-the-scenes quality work so teams can focus on building and scaling. Learn more at envisionqa.com. We also offer other things like uh software assessments, technical assessments, and really we can help you understand what your software stack is or what technologies you have if you have no idea. So we can also help you help yourself and improve your products. Good thing, bad thing. good thing. Uh had a nice restful weekend. Fourth of July was a lot of fun. Got to catch up with some old friends and uh see some people we haven't seen in a while. Weather was absolutely great. Uh it rained a little bit, but I have to say that was probably the first week we've had less rain and more sun, so that was a lot of fun. uh bad side of that. Uh we were at the riverhouse and of course as we're going through all the checklist uh we had to get some repairs done as well. So we had to replace a couple toilets and uh garbage disposal went out and then of course some plumbing issues. So uh good with the bad. >> Well, we have nothing but good ahead of us today. We're going to take the episode that was called was I guess is originally was called when coffee hits the fan real talk on developer disaster recovery and so we have thrown this into chat GBT and it comes back with that's a brilliant and catchy title because it likes us. It was going like neutral for a little bit. It's come back to I think it heard me talking about it. It's probably out there like sourcing data and it's like, ooh, we're being, you know, we're getting shade thrown at us. So, let's come back to this like little bubbly thing that it likes to do. So, let's dive right into this one. This one's interesting. So, they add a cold open episode structure. Cold open, one to two minutes. Describe a humorous or dramatic coffee spill scenario. For example, you just push to production. CI passed. You sip your coffee and then you see the site is down completely. Q theme music. That would be a really dramatic entrance into our podcast that I have never actually done. I don't think so. Interesting idea. Maybe in a future season we will adjust how we come into these things. The best we get is the like the Christmas and Thanksgiving deal music that we get in New Year's I guess when we do those. Not a whole lot there. So, all right. We missed that. All right, we're going to go into act one, real developer disasters, 8 to 10 minutes. Let's see if we can keep it to that. Hook with real world stories or invite a guest to share their worst oops moment. Now, this is interesting because this is actually giving me a very different uh answer than we've gotten in the past couple of time, well actually now you know dozen or so times that we've done this. So, hook with real world stories or invite a guest, including accidentally wiped out a production database, hot fix that broke everything, misconfigured cloud resources, uh, which equals $10,000 overnight, copied pasted secret keys into a public repository, actually spilling on a keyboard mid deploy. It is amazing how many of those things I think I think there's one maybe that I haven't had. Don't think I've actually spilled coffee on a keyboard mid deploy. I don't know that I've ever actually because I don't drink coffee that I've actually spilled it. Uh I think I've sp stuff on keyboards and lost keyboards but deploy. So I have like one of the five that they give us that has not happened. Misconfigured cloud resources. I'm just going to jump on that one real quick because that's like it's not necessarily a disaster per se, but it can be a real pretty darn close to it. And it's actually I think it's more common. I've seen it a lot of times. It's probably more common than some of these other we'll call them, you know, in quotes disasters. And it really comes down to, and I I hate to throw them under the bus, but you know, Amazon, Google, Microsoft, when you build with their tools, with their cloud tools, when you build environments, they build them, which is I'm going to give them, you know, a little bit of grace and and forgiveness here because they build them to be essentially an enterprise level solution. The problem that we have a lot of times particularly as side hustlers and developers and things like that is we don't need something of that level. So for example I used and this goes back to when we're doing our uh our learn to you know learn your uh learn to develop an internet launch your internet business. That's it. Uh we're doing that. One of the things we do is we set up we use Amazon we do an EC2 instance and we do WordPress. Well, at one point I decided to use the Amazon tools to do it to make it easier. And it set up a web pre a WordPress environment, but it included um it went out and did route 53 for SSL. It had a secure C, which was I didn't even need, but it was like, you know, that's 100 bucks right there. Um it it spun up like three different servers. I think there was like actually there was two front-end web servers. There was one back-end database and they were all connect an RDS database and then there was a load balancer and then there was like some extra a couple of extra file servers out there just for media and stuff like that which if you've got a nice big WordPress installation awesome the like daily cost of that thing that it generated for me was I think wasn't huge but it was like 50 or 60 bucks maybe a h 100red bucks a day if you're blogging and you're just starting out, you're not going to generate a hundred bucks a day to support your, you know, your and we were sitting there launch your internet business on, you know, basically less than 50 bucks. So, that would have completely blown it all up. So, we immediately said, "No, that is not an option that we're going to suggest to people." And I don't want to throw Amazon under the bus because I have done the same thing with Azure, with um or with Google. It's like they they what they build makes sense, but it also doesn't often make sense for you if you're just like, you know, playing around with it, doing some development, some testing and things like that. So, minor disasters, but they can cost you. And I have more than a couple times spent a little more money than I wanted to on uh places like especially like cloud providers and things like that. I will now toss it to you because I'm probably almost burned through all of our 8 to minute 10 minutes already. >> Yeah. So on this one, I'm going to take the database one because so many times that has bitten me in the butt because it's like okay, you're working and you may think you're in production or you may think you're not in production, you're on a different machine depending upon how you interface with the database. If you're like logging in through command line interface, they all look the same unless you go do some custom coloring to make sure that the system you're connecting to is the color for what you're on. You know, green is good, yellow is worn, you know, like QA and red is prod. Make sure you don't break anything. I'm just going to go a little kind of quickly through this one. So with this I will start with kind of the what to what you should be doing first. We time wise you don't always have this but always back up the system that you're working on. Make sure you have data backups or system backups. Uh make sure that the tools that you're using has those color features that I was just mentioning. Also make sure that if you are using tools, make sure you turn off autocommit for any environment other than dev. If you make a mistake, you can easily recover if autocommit is off. It may take longer but it's can save your butt so many times uh it won't even The other thing is uh again make sure that when you are working or you're making the changes to production make sure that you've tested these scripts out on lower environments first test them on a dev environment test them on a QA environment test them on more than one environment to make sure that they work the way you expect them to Maybe write up some testing depending upon what language you have to make sure that hey you've got the data there. Uh maybe write some um sort procedures to do some uh testing for you. Sometimes even having sort procedures doing SQL uh updates for you is a better idea because you can establish rules that if failed you can run everything as a single transaction and roll it all back. uh or you could once it's finished you can do some checks in there and if those checks pass cool. If those checks fail it could also roll back. So you can kind of build in some disaster recovery with this. But I I just have to say more times than not you're going to be in a hurry. And all I have to say is before you do delete, update, you make a change, make sure you know what system you're on because you can make a mistake and be hurting tomorrow. Other thing is back up the even if you don't have a system backup, back up the table that you're working on. So hopefully you at least have a snapshot and you can fix the data quickly. Be warned, triggers and things of that nature will make that harder. it's better to do a full database backup before you make changes. >> A couple things I want to throw on that is that um yes, it's you you definitely want to do it in a call lower level database as opposed to production. Um one of the things I was taught years ago as a DBA is that whatever you're going to do, if you're going to do an update, a delete, or anything like that that's going to change stuff, even inserts, uh depending on how you're doing the insert, is do it as a select first. I know this is a little geeky database stuff, but like if you're going to do delete blank from blank, do the select first and take a look at that and make sure that your row count looks like it should. Um, if you are typing stuff, even though like always turn your auto commit off. That is a very important thing. U make sure if you're writing something that the first thing you do is your wear clause. I have found too many times that I've tried to do something, I've been doing something and then you accidentally like flub a key and then you have to, you know, hopefully you just have to roll it back. Uh definitely use color coding wherever possible. Even if I'm using command line shells, I will use stuff that I have. Uh granted, I'm a Mac person, but I know you can do this with other systems as well. You can change the background of the command line environment that you're in. So if you tet into something, you can actually change that to make sure that you're in the proper shell. And as Michael said, typically, you know, green is, you know, development, yellow for test and red for production. And please don't, you know, mess with that kind of stuff. Um, also I will just say do not use root anywhere ever if you can possibly avoid it. I know there's some places you can't. uh in those cases don't be afraid to use things like Docker and other you know and things like that and ways to and even AI to generate essentially your own development environment even when you're dealing with production environments that are just too deal find a way to mimic that I have I had a customer for years that was uh there was no way I was going to pull all of their data down besides the fact that it was like you know private and I didn't want to deal with security it was also just too much stinking data So I did pull down the entire structure and then used a couple of tools to generate you know mock data. That kind of stuff will help you immensely and you can always pull down some specific examples as well to just test your stuff beforehand. Moving on because this is actually I'm going to fly through this one because this actually some stuff that we really just touched on sort of like AI was Michael's not looking but it was like you know thinking the same thing. So what should happen? Developer dea disaster recovery 101 break it down like a postmortem for focusing on how to build better recovery habits prevention back feature flags only access and prod detection monitoring and alerting for example Sentry Data Dog Prometheus blah blah there's so many tools out there depending on where you're at what your environment is there are monitoring and logging tools that can help you out. Uh there's even some things that will give you like warnings of hey you are doing something that is going to affect more than five rows. Are you really really sure you want to do that? Uh monitoring alerting is great. Uh rollbacks we've talked about canary deployments hot fix versus proper patching. Uh this is again is the kinds of things is where you're going to go test it out before you actually do it. Uh communication have an on call policy transparent incident update. This is important. I think even if you're in a small team, you should be regularly letting people know when you screwed up because I like I have a team that has got some people that are newer developers. Uh a lot of times we're new into an environment, things like that. It is very helpful to have in our standups mentions of by the way I did this and it turned out into something that I did not expect because that helps other people know that what the relationships are in your data what some of the you know the potholes are and things like that. Uh runbooks documentation runbooks disaster recovery plans and retrospectives. Uh I highly recommend anywhere that you can use automation of any sort including runbooks including even the things we've talked about before like ant shell scripts you know if you're depending what your environment is things like Maven uh even like continuous you know CI/CD tools like your Jenkins and things like that pipelines all those even a platform as a as code those kinds of things the more you can automate it the better that is one of the reasons I've become more and more a fan of using things like Docker uh Docker Desktop for current environments because then you can just like code the whole thing out and then if you need to replicate an environment, bam, just run it and you're off and going and you got a lot of other containers like that. Now dive into the next section which is act three tools and techniques for DR readiness. This is where I want us I think we're going to have some good conversation here. Share tools and frameworks, infrastructure, terraform, anible, kubernetes, helmcharts, uh, CI/CD, GitHub, actions, GitLab, Jenkins, backups, automate snapshots, DR regions, testing, chaos engineering like Gremlin, Netflix's, Chaos Monkey, load testing, Ksix, and Lo, Locus. There are a lot of tools that we just mentioned that we I just mentioned in that little list that I don't think we use enough unless we are we'll use them at work. We'll use them if we're working for a company or an employer and they're the kinds of things it's like because sometimes they require some time and some money and things like that set up and we don't necessarily have it but I think if you can make that part of your technical roadmap as you grow out your side hustle your these things that you you definitely need obviously obviously but I'm going to start backups like automating snapshots and and making sure this is if you're using cloud I beg you to make sure that once you put that server, even if it's, you know, Bob's pet store street, once you set up that server for him, take a snapshot and then just hold on to that because what you can do is from that snapshot, you can generate another instance almost instantaneously. I would actually say take that and share it out to another zone. So if, for example, if they're on an East Coast zone and it goes down, have it out there in the West Coast zone. So you can spend something up very quickly. This is called disaster recovery. It's at a very low and it's you don't have to spend much time or money to do so. Once you do it, I think you will realize how simple and addictive it is. That's where I'm going to stop because I know Michael has dealt with some of these things as well. And I'm curious where you want to go on this long list I just provided. >> So I'm actually going to just kind of keep it simple. So from a developer perspective, if you're new to if a lot of these terms are new to you and or you've touched on one thing or another, but not all of it, start playing with containers. If you are not using containers today, start now. I I I basically beaten the horse dead with the kitchen sink idea. But you can build a kitchen sink application or model using containers and these frameworks like load balancers, uh like disaster recovery, spinning of production environments. Essentially, anything that you do, throw it in a container, replicate it, make sure it can be replicated. You can share those containers between developers. You can then take that container, stick it, literally take that image, push it up to another environment and see if it scales. If it doesn't, well, make it beefier. See if you can play with it. Containers let you do so much so quickly and they're really very quick to build and easy to throw away. So, if you mess it up, nope, drop it, rebuild it, start again. These are also very good ways to kind of centralize your development environments. So you can create one development environment, get all your tools in there for your developers for what they need, share it out to everyone. They can use that. They break it, drop it, rebuild it. All your code should be in a code repository. So you should be able to pull it up and down. Backups are easy. You just take a snapshot. Done. Literally with in the world of testing, this is what should be happening and a lot of your big corporate companies are using things like uh you know um is it Sonar Bayer uh test GE uh Cucumber they use these things but they also use cloud-based uh grid kind of ideas to differentiate and test multiple systems at once to test multiple you can run multi-threaded tests against your current systems. start playing around with that if you're not doing that now. These are things that you should and really need to learn and know to be able to succeed in enterprise and also you want to make sure that your products that you build for your customers can scale and this is the best way to do it >> and it really is it's one of those that I don't know how often things have gotten lost in configuration uh there that as developer spent over the I lost a year my entire time doing configuration issues and fixing those kinds of things. I know my team has regularly had like every project there's been at least at some point where each developer probably loses a day or two on configuration issues and things like that. Containers can really help you move that kind of stuff forward and homogenize the environment so you don't have to worry about, you know, Bob's got this setup and Al's got this other one and Sam's got a third setup and then none of it, you know, they and then once they commit stuff, it doesn't always work. Those kinds of things can get actually repaired very quickly. And then especially when you get into complicated things where you have to configure an environment, one person can do it and then they share it out and you don't have to worry about everybody else having to go through that process. Now dive into the last section because we're we're time and just sort of cover this in a you know a high level. Uh culture calm promote the human side of handling disasters. Blameless postmortems stay calm under pressure. Nobody codes well in panic. It's not if it's not if, it's when and how you respond. Now, I I just want to say stuff happens. There is going to be things where there's going to be uh bad queries. There's going to be something happens and a ser down. There's going to be things that a drive will fail. Uh if you go back to our season, we talked about lessons learned from mistakes. I think there were three or four episodes at least that were mistakes where the disaster recovery plan was not in place beforehand. U test them and really this stuff can be so simple at times. I know especially from hustle if you're sitting there and you're building an a Apache web application and you're just throwing in this one folder how hard is it to back up that folder and put it somewhere else and just put it on another machine you can like if you're if somebody's paying you to develop you should be able to afford having your own local development environment that is different from production that is close enough that you can make it work or something along those lines especially if you're using containers ers, you should be able to replicate that stuff good enough to be able to do some testing and do even DR testing beforehand. More importantly, shoot us an email at [email protected] because we would love to hear about what is it what is it you're going through? What are some of your uh DR disasters? What are some situations where something happened and you guys and also like what are some of the things that helped you where like hey we did this thing that is not the norm but it helped us through a disaster and then some of the tools that you use because there are a lot of them out there and again a lot of them are environment we'll say dependent however there's a lot out there and it's always good to hear about new ones. We'd love to throw that out to the group and just say, "Hey, by the way, here's something else you guys can check out." That being said, we're going to wrap this one up. As always, you can leave us obviously the email, but shoot us something at developer.com. We've got anywhere you want to. We've got articles, all kinds of stuff there. Leave us a a comment, leave us just any kind of review there. Anywhere that you're listening to podcast, if you're finding a place for podcast and we're not there, we'll get there. YouTube at developreneur out at that channel developer onx uh Facebook we apparently have a page out there and a lot of other places like that. So let us know anywhere you want to find us if you don't. We will find a way to get there um sooner or later. It may take us a couple of minutes. That being said, go out there and have yourself a great day, a great week, and we will talk to you next time. Now, bonus material. It actually gives us this time. Bonus segments, and it it's disaster snackables, a 60-cond real fail from listeners, which we're not going to reach out to you guys, so you can, you know, like just chill out. Coffee cup advice, one actionable DR tip for your next sprint. And so, I'm going to put you on the hot seat. And for the next sprint, regardless of what where somebody's at, what would be a good DR tip that they can verify they do or add to their their application next time around? Well, we kind of touched on it in this one. The the one I would say is make sure that your databases are being backed up. If you have a database, make sure that you are using your tools correctly for the different environments. Uh, one thing that we didn't really touch on through this particular episode, uh, because in most modern day applications, we're dealing with cloud-based or software as a service application. So, stuff is in the cloud. However, you will still run into situations where you have customers with machines in the office running their software and services. First and foremost, make sure you have a power backup. I will throw that one out there because that one runs in more often than not. And two, make sure your power cables are nowhere near where your janitor might unplug something to plug in a vacuum cleaner. Not bad. I have seen power issues. I had a customergo that that was regularly the thing. The problem was I guess part of the issue was that we were dealing with was actually literally in a closet. Not a data closet, but just like a broom closet and it regularly got turned off. and I'd be like and I would have to remote in. So, um I think from the data I want to jump to the database one because we really didn't talk about this. Not only do you need to back up your database but restore it, do an actual test of it. Restore it and make sure you can connect to your database. There are a lot of times uh I actually recently had an issue where I had lost a bunch of databases because I got a corrupted server, database server, a lot of them. And when I brought them back, one of the things that I'd forgotten was I had a lot of users that I had to create in order to actually deal with those application. So the applications could deal with data in this case. Uh and there's also particular you're going to run into things if you're in uh I know SQL server does this a lot. I don't know. I don't think Oracle does, but pick on SQL Server where their ids are goods. And so you need to make sure when you bring stuff in in that it's not regenerating UI like for records because if your primary keys get broken and you've got related foreign keys, guess what? Those are going to be broken too. So test your disaster recovery. Uh the simplest thing I would say uh for your next sprint if you are not backing up your source code and your database separately and putting them somewhere that's on a different machine do that. It's a very easy script to write. You could have an AI engine can write one for you. Whether you want to do it in pick the language you want it written in, it will do it and it'll get you something close enough, take you probably 15, 30 minutes tops to have a nice little backup script for your stuff. We have not backed this up, so we're going to go do this right away because we never know what happens to our our episodes. I just a little bit because I don't think I've ever actually lost an episode. I've always hit record in time and so knocking on some not actual wood but fake wood. Um hopefully that does not happen this time. We will be back. We're going to continue this. We've got plenty of episodes left. We're like I don't know only halfway through the season or something like that. So we've got a lot more artificial intelligence ahead and all of the shenanigans that it causes. So go out there and have yourself a good one. Thanks for watching. We will talk to you next time. [Music]
Transcript Segments
[Music]
That was a recording in progress. All
right, this was going to be fun. Uh,
diving right in. We were just whining
before this, so we're trying to adjust
here and get back to being normal people
and not the whining little pansies that
we are. And yes, we were whining about
work. So, welcome to our world is the
same as your world probably.
So, uh this time we're going to
continue. This is the when coffee hits
the fan real talk on developer disaster
recovery.
Um, this is actually interesting because
I already threw it in there and the
first thing came back is that's a
brilliant and catchy title which I think
was generated by itself. So, it's like
hiding itself on the back while it's
like, "Wow, that's awesome. I'm glad you
sent me one of those or you used our
title." All right. Well, let's dive into
it and see how it goes with a little
three, two,
one. Well, hello and welcome back. We
are continuing our season where it is
with AI. Yes, we are building better
developers the developer podcast. This
season we are going back through prior
season going through some of the
episodes and basically shoving those
into an AI engine uh specifically chat
GPT and seeing what it gives us back as
its recommend recommendations just to
sort of see how AI does stuff. Before
that I need to talk about how I do
stuff. Who am I? I am Rob Broadhead. I'm
one of the founders of developer
developer also a founder of actually the
founder of RB consulting where we are
a we are you know it's a lot of
different ways to refer to it sometimes
we're called a fractional CTO CIO
sometimes we' be referred to as boutique
consulting the bottom line is we sit
down with our customers we work through
their business talk about what are they
doing what are their processes what are
their goals and their vision and then we
look at what kind of technologies do
they have what kind technologies are out
there for them. And we will use a whole
bunch of different tools, whether it's
simplification, integration, automation,
innovation, any of those things and more
we will use to help them craft a
essentially a recipe for success that is
unique to that company because
everybody's got a little bit of a, you
know, different take on stuff. You got
different staff, different needs,
different resources. We help mirror
those or actually marry those things
together, provide a technical road map
and then can help you implement it or
can let you go on and just on your merry
way following that road map. Good thing
bad thing uh very much near and dear to
my heart right now. Good thing is we are
going through and updating the town home
that we just recently got. We got a
whole bunch of little things too. So
that's great. That's awesome. And I've
got a wife that does that stuff. So
she's off doing all that cool things.
Bad thing is is that sometimes she can't
get some of those things done because it
needs as she calls it somebody with
muscles. And yes, I happen to have a
couple. So, I had to actually come to
the townhouse, which we don't have,
which is really the bad thing. We don't
really have Wi-Fi here. So, I'm working
off my phone. So, I may not have the
best connectivity this time around,
which is okay because at some point that
means I'll just block
have to see him. But you will if you're
on the YouTube. but more importantly
let him introduce him hopefully dig
himself out of the hole that I just
created for him. Apologies.
>> Hey everyone, my name is Michael Malash.
I'm one of the co-founders of Del
Developer. I'm also the founder and
owner of Envision QA where we help
startups and growing companies launch
better products faster. That means fewer
bugs, smoother customer experience, and
less wasted time and money. We take care
of the behind-the-scenes quality work so
teams can focus on building and scaling.
Learn more at envisionqa.com.
We also offer other things like uh
software assessments, technical
assessments, and really we can help you
understand what your software stack is
or what technologies you have if you
have no idea. So we can also help you
help yourself and improve your products.
Good thing, bad thing. good thing. Uh
had a nice restful weekend. Fourth of
July was a lot of fun. Got to catch up
with some old friends and uh see some
people we haven't seen in a while.
Weather was absolutely great. Uh it
rained a little bit, but I have to say
that was probably the first week we've
had less rain and more sun, so that was
a lot of fun. uh bad side of that. Uh
we were at the riverhouse and of course
as we're going through all the checklist
uh we had to get some repairs done as
well. So we had to replace a couple
toilets and uh garbage disposal went out
and then of course some plumbing issues.
So uh good with the bad.
>> Well, we have nothing but good ahead of
us today. We're going to take the
episode that was called was I guess is
originally was called when coffee hits
the fan real talk on developer disaster
recovery and so we have thrown this into
chat GBT and it comes back with that's a
brilliant and catchy title because it
likes us. It was going like neutral for
a little bit. It's come back to I think
it heard me talking about it. It's
probably out there like sourcing data
and it's like, ooh, we're being, you
know, we're getting shade thrown at us.
So, let's come back to this like little
bubbly thing that it likes to do. So,
let's dive right into this one. This
one's interesting. So, they add a cold
open episode structure. Cold open, one
to two minutes. Describe a humorous or
dramatic coffee spill scenario. For
example, you just push to production. CI
passed. You sip your coffee and then you
see the site is down completely. Q theme
music. That would be a really dramatic
entrance into our podcast that I have
never actually done. I don't think so.
Interesting idea. Maybe in a future
season we will adjust how we come into
these things. The best we get is the
like the Christmas and Thanksgiving deal
music that we get in New Year's I guess
when we do those. Not a whole lot there.
So, all right. We missed that. All
right, we're going to go into act one,
real developer disasters, 8 to 10
minutes. Let's see if we can keep it to
that. Hook with real world stories or
invite a guest to share their worst oops
moment. Now, this is interesting because
this is actually giving me a very
different uh answer than we've gotten in
the past couple of time, well actually
now you know dozen or so times that
we've done this. So, hook with real
world stories or invite a guest,
including accidentally wiped out a
production database, hot fix that broke
everything, misconfigured cloud
resources, uh, which equals $10,000
overnight, copied pasted secret keys
into a public repository, actually
spilling on a keyboard mid deploy. It is
amazing
how many of those things I think I think
there's one maybe that I haven't had.
Don't think I've actually spilled coffee
on a keyboard mid deploy. I don't know
that I've ever actually because I don't
drink coffee that I've actually spilled
it. Uh I think I've sp stuff on
keyboards and lost keyboards but deploy.
So I have like one of the five that they
give us that has not happened.
Misconfigured cloud resources. I'm just
going to jump on that one real quick
because that's like it's not necessarily
a disaster per se, but it can be a real
pretty darn close to it. And it's
actually I think it's more common. I've
seen it a lot of times. It's probably
more common than some of these other
we'll call them, you know, in quotes
disasters.
And it really comes down to, and I I
hate to throw them under the bus, but
you know, Amazon, Google, Microsoft,
when you build with their tools, with
their cloud tools, when you build
environments, they build them, which is
I'm going to give them, you know, a
little bit of grace and and forgiveness
here because they build them to be
essentially an enterprise level
solution. The problem that we have a lot
of times particularly as side hustlers
and developers and things like that is
we don't need something of that level.
So for example
I used and this goes back to when we're
doing our uh our learn to you know learn
your uh learn to develop an internet
launch your internet business. That's
it. Uh we're doing that. One of the
things we do is we set up we use Amazon
we do an EC2 instance and we do
WordPress. Well, at one point I decided
to use the Amazon tools to do it to make
it easier. And it set up a web pre a
WordPress environment, but it included
um it went out and did route 53 for SSL.
It had a secure C, which was I didn't
even need, but it was like, you know,
that's 100 bucks right there. Um it it
spun up like three different servers. I
think there was like actually there was
two front-end web servers. There was one
back-end database and they were all
connect an RDS database and then there
was a load balancer and then there was
like some extra a couple of extra file
servers out there just for media and
stuff like that which if you've got a
nice big WordPress installation awesome
the like daily cost of that thing that
it generated for me was I think wasn't
huge but it was like 50 or 60 bucks
maybe a h 100red bucks a day if you're
blogging and you're just starting out,
you're not going to generate a hundred
bucks a day to support your, you know,
your and we were sitting there launch
your internet business on, you know,
basically less than 50 bucks. So, that
would have completely blown it all up.
So, we immediately said, "No, that is
not an option that we're going to
suggest to people." And I don't want to
throw Amazon under the bus because I
have done the same thing with Azure,
with um or with Google. It's like they
they what they build makes sense, but it
also doesn't often make sense for you if
you're just like, you know, playing
around with it, doing some development,
some testing and things like that. So,
minor disasters, but they can cost you.
And I have more than a couple times
spent a little more money than I wanted
to on uh places like especially like
cloud providers and things like that.
I will now toss it to you because I'm
probably almost burned through all of
our 8 to minute 10 minutes already.
>> Yeah. So on this one, I'm going to take
the database one because
so many times that has bitten me in the
butt because it's like okay, you're
working and you may think you're in
production or you may think you're not
in production, you're on a different
machine depending upon how you interface
with the database. If you're like
logging in through command line
interface,
they all look the same unless you go do
some custom coloring to make sure that
the system you're connecting to is the
color for what you're on. You know,
green is good, yellow is worn, you know,
like QA and red is prod. Make sure you
don't break anything.
I'm just going to go a little kind of
quickly through this one. So with this
I will start with kind of the what to
what you should be doing first. We time
wise you don't always have this but
always back up the system that you're
working on. Make sure you have data
backups or system backups. Uh make sure
that the tools that you're using has
those color features that I was just
mentioning. Also make sure that if you
are using tools, make sure you turn off
autocommit for any environment other
than dev.
If you make a mistake, you can easily
recover if autocommit is off. It may
take longer but it's can save your butt
so many times uh it won't even The other
thing is
uh
again make sure that when you are
working or you're making the changes to
production make sure that you've tested
these scripts out on lower environments
first test them on a dev environment
test them on a QA environment test them
on more than one environment to make
sure that they work the way you expect
them to
Maybe write up some testing depending
upon what language you have to make sure
that hey you've got the data there. Uh
maybe write some um sort procedures to
do some uh testing for you. Sometimes
even having sort procedures doing SQL uh
updates for you is a better idea because
you can establish rules that if failed
you can run everything as a single
transaction and roll it all back. uh or
you could once it's finished you can do
some checks in there and if those checks
pass cool. If those checks fail it could
also roll back. So you can kind of build
in some disaster recovery with this. But
I I just have to say more times than not
you're going to be in a hurry. And all I
have to say is before you do delete,
update, you make a change, make sure you
know what system you're on because you
can make a mistake and be hurting
tomorrow.
Other thing is back up the even if you
don't have a system backup, back up the
table that you're working on. So
hopefully you at least have a snapshot
and you can fix the data quickly. Be
warned, triggers and things of that
nature will make that harder. it's
better to do a full database backup
before you make changes.
>> A couple things I want to throw on that
is that um yes, it's you you definitely
want to do it in a call lower level
database as opposed to production. Um
one of the things I was taught years ago
as a DBA is that whatever you're going
to do, if you're going to do an update,
a delete, or anything like that that's
going to change stuff, even inserts, uh
depending on how you're doing the
insert, is do it as a select first. I
know this is a little geeky database
stuff, but like if you're going to do
delete blank from blank, do the select
first and take a look at that and make
sure that your row count looks like it
should. Um, if you are typing stuff,
even though like always turn your auto
commit off. That is a very important
thing. U make sure if you're writing
something that the first thing you do is
your wear clause. I have found too many
times that I've tried to do something,
I've been doing something and then you
accidentally like flub a key and then
you have to, you know, hopefully you
just have to roll it back. Uh definitely
use color coding wherever possible. Even
if I'm using command line shells, I will
use stuff that I have. Uh granted, I'm a
Mac person, but I know you can do this
with other systems as well. You can
change the background of the command
line environment that you're in. So if
you tet into something, you can actually
change that to make sure that you're in
the proper shell. And as Michael said,
typically, you know, green is, you know,
development, yellow for test and red for
production. And please don't, you know,
mess with that kind of stuff. Um, also I
will just say do not use root anywhere
ever if you can possibly avoid it. I
know there's some places you can't. uh
in those cases don't be afraid to use
things like Docker and other you know
and things like that and ways to and
even AI to generate essentially your own
development environment even when you're
dealing with production environments
that are just too deal find a way to
mimic that I have I had a customer for
years that was uh there was no way I was
going to pull all of their data down
besides the fact that it was like you
know private and I didn't want to deal
with security it was also just too much
stinking data
So I did pull down the entire structure
and then used a couple of tools to
generate you know mock data. That kind
of stuff will help you immensely and you
can always pull down some specific
examples as well to just test your stuff
beforehand. Moving on because this is
actually I'm going to fly through this
one because this actually some stuff
that we really just touched on sort of
like AI was Michael's not looking but it
was like you know thinking the same
thing. So what should happen? Developer
dea disaster recovery 101 break it down
like a postmortem for focusing on how to
build better recovery habits
prevention back feature flags only
access and prod detection monitoring and
alerting for example Sentry Data Dog
Prometheus blah blah there's so many
tools out there depending on where
you're at what your environment is there
are monitoring and logging tools that
can help you out. Uh there's even some
things that will give you like warnings
of hey you are doing something that is
going to affect more than five rows. Are
you really really sure you want to do
that? Uh monitoring alerting is great.
Uh rollbacks we've talked about canary
deployments hot fix versus proper
patching. Uh this is again is the kinds
of things is where you're going to go
test it out before you actually do it.
Uh communication have an on call policy
transparent incident update. This is
important. I think even if you're in a
small team, you should be regularly
letting people know when you screwed up
because I like I have a team that has
got some people that are newer
developers. Uh a lot of times we're new
into an environment, things like that.
It is very helpful to have in our
standups mentions of by the way I did
this and it turned out into something
that I did not expect because that helps
other people know that what the
relationships are in your data what some
of the you know the potholes are and
things like that. Uh runbooks
documentation runbooks disaster recovery
plans and retrospectives. Uh I highly
recommend anywhere that you can use
automation of any sort including
runbooks including even the things we've
talked about before like ant shell
scripts you know if you're depending
what your environment is things like
Maven uh even like continuous you know
CI/CD tools like your Jenkins and things
like that pipelines all those even a
platform as a as code those kinds of
things the more you can automate it the
better that is one of the reasons I've
become more and more a fan of using
things like Docker uh Docker Desktop for
current environments because then you
can just like code the whole thing out
and then if you need to replicate an
environment, bam, just run it and you're
off and going and you got a lot of other
containers like that. Now dive into the
next section which is act three tools
and techniques for DR readiness. This is
where I want us I think we're going to
have some good conversation here. Share
tools and frameworks, infrastructure,
terraform, anible, kubernetes,
helmcharts, uh, CI/CD, GitHub, actions,
GitLab, Jenkins, backups, automate
snapshots, DR regions, testing, chaos
engineering like Gremlin, Netflix's,
Chaos Monkey, load testing, Ksix, and
Lo, Locus. There are a lot of tools that
we just mentioned that we I just
mentioned in that little list that I
don't think we use enough unless we are
we'll use them at work. We'll use them
if we're working for a company or an
employer and they're the kinds of things
it's like because sometimes they require
some time and some money and things like
that set up and we don't necessarily
have it but I think if you can make that
part of your technical roadmap as you
grow out your side hustle your these
things that you you definitely need
obviously obviously but I'm going to
start backups like automating snapshots
and and making sure this is if you're
using cloud
I beg you to make sure that once you put
that server, even if it's, you know,
Bob's pet store street, once you set up
that server for him, take a snapshot and
then just hold on to that because what
you can do is from that snapshot, you
can generate another instance almost
instantaneously.
I would actually say take that and share
it out to another zone. So if, for
example, if they're on an East Coast
zone and it goes down, have it out there
in the West Coast zone. So you can spend
something up very quickly. This is
called disaster recovery. It's at a very
low and it's you don't have to spend
much time or money to do so. Once you do
it, I think you will realize how simple
and addictive it is. That's where I'm
going to stop because I know Michael has
dealt with some of these things as well.
And I'm curious where you want to go on
this long list I just provided.
>> So I'm actually going to just kind of
keep it simple. So from a developer
perspective,
if you're new to if a lot of these terms
are new to you and or you've touched on
one thing or another, but not all of it,
start playing with containers. If you
are not using containers today, start
now.
I I I basically beaten the horse dead
with the kitchen sink idea. But you can
build a kitchen sink application or
model using containers and these
frameworks like load balancers, uh like
disaster recovery, spinning of
production environments.
Essentially, anything that you do,
throw it in a container, replicate it,
make sure it can be replicated. You can
share those containers between
developers. You can then take that
container, stick it, literally take that
image, push it up to another environment
and see if it scales. If it doesn't,
well, make it beefier. See if you can
play with it. Containers let you do so
much so quickly and they're really very
quick to build and easy to throw away.
So, if you mess it up, nope, drop it,
rebuild it, start again. These are also
very good ways to kind of centralize
your development environments. So you
can create one development environment,
get all your tools in there for your
developers for what they need, share it
out to everyone. They can use that. They
break it, drop it, rebuild it. All your
code should be in a code repository. So
you should be able to pull it up and
down. Backups are easy. You just take a
snapshot. Done.
Literally with in the world of testing,
this is what should be happening and a
lot of your big corporate companies are
using things like uh you know um is it
Sonar Bayer uh test GE uh Cucumber they
use these things but they also use
cloud-based uh grid kind of ideas to
differentiate and test multiple systems
at once to test multiple you can run
multi-threaded tests against your
current systems.
start playing around with that if you're
not doing that now. These are things
that you should and really need to learn
and know to be able to succeed in
enterprise and also you want to make
sure that your products that you build
for your customers can scale and this is
the best way to do it
>> and it really is it's one of those that
I don't know how often things have
gotten lost in configuration uh there
that as developer spent over the I lost
a year my entire time doing
configuration issues and fixing those
kinds of things. I know my team has
regularly had like every project there's
been at least at some point where each
developer probably loses a day or two on
configuration issues and things like
that. Containers can really help you
move that kind of stuff forward and
homogenize the environment so you don't
have to worry about, you know, Bob's got
this setup and Al's got this other one
and Sam's got a third setup and then
none of it, you know, they and then once
they commit stuff, it doesn't always
work. Those kinds of things can get
actually repaired very quickly. And then
especially when you get into complicated
things where you have to configure an
environment, one person can do it and
then they share it out and you don't
have to worry about everybody else
having to go through that process. Now
dive into the last section because we're
we're time and just sort of cover this
in a you know a high level. Uh culture
calm promote the human side of handling
disasters. Blameless postmortems stay
calm under pressure. Nobody codes well
in panic. It's not if it's not if, it's
when and how you respond. Now, I I just
want to say
stuff happens. There is going to be
things where there's going to be uh bad
queries. There's going to be something
happens and a ser down. There's going to
be things that a drive will fail. Uh if
you go back to our season, we talked
about lessons learned from mistakes. I
think there were three or four episodes
at least that were mistakes where the
disaster recovery plan was not in place
beforehand. U test them and really this
stuff can be so simple at times. I know
especially from hustle if you're sitting
there and you're building an a Apache
web application and you're just throwing
in this one folder how hard is it to
back up that folder and put it somewhere
else and just put it on another machine
you can like if you're if somebody's
paying you to develop you should be able
to afford having your own local
development environment that is
different from production that is close
enough that you can make it work or
something along those lines especially
if you're using containers ers, you
should be able to replicate that stuff
good enough to be able to do some
testing and do even DR testing
beforehand.
More importantly, shoot us an email at
because we would love to hear about what
is it what is it you're going through?
What are some of your uh DR disasters?
What are some situations where something
happened and you guys and also like what
are some of the things that helped you
where like hey we did this thing that is
not the norm but it helped us through a
disaster and then some of the tools that
you use because there are a lot of them
out there and again a lot of them are
environment we'll say dependent however
there's a lot out there and it's always
good to hear about new ones. We'd love
to throw that out to the group and just
say, "Hey, by the way, here's something
else you guys can check out."
That being said, we're going to wrap
this one up. As always, you can leave us
obviously the email, but shoot us
something at developer.com. We've got
anywhere you want to. We've got
articles, all kinds of stuff there.
Leave us a a comment, leave us just any
kind of review there. Anywhere that
you're listening to podcast, if you're
finding a place for podcast and we're
not there, we'll get there. YouTube at
developreneur out at that channel
developer onx uh Facebook we apparently
have a page out there and a lot of other
places like that. So let us know
anywhere you want to find us if you
don't. We will find a way to get there
um sooner or later. It may take us a
couple of minutes. That being said, go
out there and have yourself a great day,
a great week, and we will talk to you
next time. Now, bonus material. It
actually gives us this time. Bonus
segments, and it it's disaster
snackables, a 60-cond real fail from
listeners, which we're not going to
reach out to you guys, so you can, you
know, like just chill out. Coffee cup
advice, one actionable DR tip for your
next sprint.
And so, I'm going to put you on the hot
seat. And for the next sprint,
regardless of what where somebody's at,
what would be a good DR tip that they
can verify they do or add to their their
application next time around?
Well, we kind of touched on it in this
one. The the one I would say is make
sure that your databases are being
backed up. If you have a database, make
sure that you are using your tools
correctly for the different
environments. Uh, one thing that we
didn't really touch on through this
particular episode, uh, because in most
modern day applications, we're dealing
with cloud-based or software as a
service application. So, stuff is in the
cloud. However, you will still run into
situations where you have customers with
machines in the office running their
software and services.
First and foremost, make sure you have a
power backup.
I will throw that one out there because
that one runs in more often than not.
And two, make sure your power cables are
nowhere near where your janitor might
unplug something to plug in a vacuum
cleaner.
Not bad. I have seen power issues. I had
a customergo that that was regularly the
thing. The problem was I guess part of
the issue was that we were dealing with
was actually literally in a closet. Not
a data closet, but just like a broom
closet and it regularly got turned off.
and I'd be like and I would have to
remote in. So, um I think from the data
I want to jump to the database one
because we really didn't talk about
this. Not only do you need to back up
your database but restore it, do an
actual test of it. Restore it and make
sure you can connect to your database.
There are a lot of times uh I actually
recently had an issue where I had lost a
bunch of databases because I got a
corrupted server, database server, a lot
of them. And when I brought them back,
one of the things that I'd forgotten was
I had a lot of users that I had to
create in order to actually deal with
those application. So the applications
could deal with data in this case. Uh
and there's also particular you're going
to run into things if you're in uh I
know SQL server does this a lot. I don't
know. I don't think Oracle does, but
pick on SQL Server where their ids are
goods. And so you need to make sure when
you bring stuff in in that it's not
regenerating UI like for records because
if your primary keys get broken and
you've got related foreign keys, guess
what? Those are going to be broken too.
So
test your disaster recovery. Uh the
simplest thing I would say
uh for your next sprint if you are not
backing up your source code and your
database
separately and putting them somewhere
that's on a different machine do that.
It's a very easy script to write. You
could have an AI engine can write one
for you. Whether you want to do it in
pick the language you want it written
in, it will do it and it'll get you
something close enough, take you
probably 15, 30 minutes tops to have a
nice little backup script for your
stuff.
We have not backed this up, so we're
going to go do this right away because
we never know what happens to our our
episodes. I just a little bit because I
don't think I've ever actually lost an
episode. I've always hit record in time
and so knocking on some not actual wood
but fake wood. Um hopefully that does
not happen this time. We will be back.
We're going to continue this. We've got
plenty of episodes left. We're like I
don't know only halfway through the
season or something like that. So we've
got a lot more artificial intelligence
ahead and all of the shenanigans that it
causes. So go out there and have
yourself a good one. Thanks for
watching. We will talk to you next time.
[Music]