🎙 Develpreneur Podcast Episode

Audio + transcript

What Happens When Software Fails? Tools and Tactics to Recover Fast

In this episode, Rob and Michael discuss the importance of having a disaster recovery plan in place and the tools and tactics to recover fast. They cover the need for automated backups and snapshots, the benefits of using containers and Docker, and the importance of monitoring and logging tools.

2025-07-12 •Season 25 • Episode 12 •Developer Disaster Recovery •Podcast

Summary

In this episode, Rob and Michael discuss the importance of having a disaster recovery plan in place and the tools and tactics to recover fast. They cover the need for automated backups and snapshots, the benefits of using containers and Docker, and the importance of monitoring and logging tools.

Detailed Notes

The episode discusses the importance of having a disaster recovery plan in place, particularly for developers who work on critical systems. The hosts, Rob and Michael, share their experiences with disaster recovery and the tools and tactics they use to recover fast. They emphasize the need for automated backups and snapshots, as well as the benefits of using containers and Docker. They also discuss the importance of monitoring and logging tools, and the need for communication and transparency in incident response. The episode covers a range of topics, including the importance of having a disaster recovery plan in place, the benefits of using containers and Docker, and the need for monitoring and logging tools.

Highlights

  • The importance of having a disaster recovery plan in place
  • The need for automated backups and snapshots
  • The benefits of using containers and Docker
  • The importance of monitoring and logging tools
  • The need for communication and transparency in incident response

Key Takeaways

  • Developers need to have a disaster recovery plan in place
  • Automated backups and snapshots are essential
  • Containers and Docker can help with disaster recovery
  • Monitoring and logging tools are critical
  • Communication and transparency are key in incident response

Practical Lessons

  • Use automated backups and snapshots to ensure data is safe
  • Use containers and Docker to make disaster recovery easier
  • Implement monitoring and logging tools to detect issues early
  • Communicate with your team and stakeholders during incidents

Strong Lines

  • The importance of having a disaster recovery plan in place cannot be overstated
  • Automated backups and snapshots are the key to disaster recovery
  • Containers and Docker can make disaster recovery easier and faster

Blog Post Angles

  • The importance of having a disaster recovery plan in place
  • The benefits of using containers and Docker for disaster recovery
  • The role of monitoring and logging tools in disaster recovery

Keywords

  • Disaster recovery
  • Containers
  • Docker
  • Monitoring
  • Logging
Transcript Text
Welcome to building better developers, the developer podcast, where we work on getting better step by step professionally and personally. Let's get started. Well, hello and welcome back. We are continuing our season where it is with AI. Yes, we are building better developers, developing our podcast. This season, we're going back through prior season, going through some of the episodes and basically shoving those into an AI engine, specifically chat GPT and seeing what it gives us back as its recommendations, recommendations, just to sort of see how AI does stuff. Before that, I need to talk about how I do stuff. Who am I? I am Rob Brodhead. I'm one of the founders of development or also a founder of actually the founder of where we are. We are. It's a lot of different ways to refer to it. Sometimes we're called a fractional CTO, CIO. Sometimes we'd be referred to as boutique consulting. The bottom line is we sit down with our customers. We work through their business, talk about what are they doing? What are their processes? What are their goals and their vision? And we look at what kind of technologies do they have? What kind of technologies are out there for them? And we will use a whole bunch of different tools, whether it's simplification, integration, automation, innovation, any of those things and more, we will use to help them craft a essentially a recipe for success that is unique to that company because everybody's got a little bit of a different take on stuff. You got different staff, different needs, different resources. We help mirror those or actually marry those things together, provide a technical roadmap and then can help you implement it or can let you go on and just on your merry way following that roadmap. Good thing, bad thing. Very much near and dear to my heart right now. Good thing is we are going through and updating the townhome that we just recently got. We got a whole bunch of little things. So that's great. That's awesome. I've got a wife that does that stuff. So she's off doing all that cool things. Bad thing is, is that sometimes she can't get some of those things done because it needs that she calls it somebody with muscles. And yes, I happen to have a couple. So I had to actually come to the townhouse, which we don't have, which is really the bad thing. We don't really have Wi-Fi here. So I'm working off my phone. So I may not have the best connectivity this time around, which is okay because at some point that means I'll just block. Have to see him. But you will if you're on the YouTube. But more importantly, let him introduce himself. Dig himself out of the hole that I just created for him. Apologies. Hey everyone. My name is Michael Milosh. I'm one of the co-founders of Dell Developer Nur. I'm also the founder and owner of Envision QA, where we help startups and growing companies launch better products faster. That means fewer bugs, smoother customer experience, and less wasted time and money. We take care of the behind the scenes quality work so teams can focus on building and scaling. Learn more at EnvisionQA.com. We also offer other things like software assessments, technical assessments, and really we can help you understand what your software stack is or what technologies you have if you have no idea. So we can also help you help yourself and improve your products. Good thing, bad thing. Good thing. Had a nice restful weekend. Fourth of July was a lot of fun. Got to catch up with some old friends and see some people we haven't seen in a while. Weather was absolutely great. It rained a little bit, but I have to say that was probably the first week we've had less rain and more sun, so that was a lot of fun. Bad side of that, we were at the river house and of course as we're going through all the checklists, we had to get some repairs done as well. So we had to replace a couple toilets and garbage disposal went out and then of course some plumbing issues. So good with the bad. Well, we have nothing but good ahead of us today. We're going to take the episode that was called, I guess is originally was called When Coffee Hits the Fan, Real Talk on Developer Disaster Recovery. And so we have thrown this into chat GPT and it comes back with that's a brilliant and catchy title because it likes us. It was going like neutral for a little bit. It's come back to I think it heard me talking about it. It's probably out there like sourcing data and it's like, oh, we're being you know, we're getting shade thrown at us. So let's come back to this like little bubbly thing that it likes to do. So it's dive right into this one. This one's interesting. So they add a cold open episode structure cold open one to two minutes. Describe a humorous or dramatic coffee spill scenario. For example, you just pushed to production. See I passed you sip your coffee and then you see the site is down completely. Q theme music. That would be a really dramatic entrance into our podcast that I have never actually done. I don't think so. Interesting idea. Maybe in the future season we will adjust how we come into these things. The best we get is the like the Christmas and Thanksgiving deal music that we get in New Year's. I guess we do those. Not a whole lot there. So all right, we missed that. But we're going to go into Act one real developer disasters. Eight to ten minutes. Let's see if we can keep it to that. Hook with real world stories or invite a guest to share your worst. Oops moment. Now this is interesting because this is actually giving me a very different answer than we've gotten in the past couple of times. Well actually now you know, dozen or so times that we've done this. So hook with real world stories or invite a guest including accidentally wiped out a production database hotfix. It broke everything. Misconfigured cloud resources, which equals ten thousand dollars overnight. Copied pasted secret keys into a public repository. Actually spilling on a keyboard mid deploy. It is amazing how many of those things I think I think there's one maybe that I haven't had. I don't think I've actually spilled coffee on a keyboard mid deploy. I don't know that I've ever actually because I don't drink coffee that I've actually spilled it. I think I've stuff on keyboards and lost keyboards. But deploy. So I have like one of the five that they give us. It has not happened. Misconfigured cloud resources. I'm just going to jump on that one real quick because that's like it's not necessarily a disaster per se. But it can be a real pretty darn close to it. And it's actually I think is more common. I've seen it a lot of times. It's probably more common than some of these other we'll call them in quotes disasters. And it really comes down to and I hate to throw them under the bus but Amazon, Google, Microsoft when you build with their tools with their cloud tools when you build environments they build them which is I'm going to give them a little bit of grace. And forgiveness here because they build them to be essentially an enterprise level solution. The problem that we have a lot of time particularly as side hustlers and developers and things like that is we don't need something of that level. So for example, I used this goes back to when we're doing our our learn to learn your learn to develop an Internet launch your Internet business. That's it. We're doing that one of the things we do is we set up we use Amazon we do an easy to instance and we do WordPress. Well at one point I decided to use the Amazon tools to do it to make it easier. And it set up a web WordPress environment but it included. It went out and did route 53 for SSL and had a secure cert which was I didn't even need but it was like you know that's 100 bucks right there. It's been spun up like three different servers. I think there was actually there was two front end web servers. There was one back in database and they were all connected RDS database. And then there was a load balancer and there was like some extra a couple of extra files servers out there just for media and stuff like that which if you've got a nice big WordPress installation. Awesome. The the like daily cost of that thing that it generated for me was I think one huge but it's like 50 or 60 bucks maybe 100 bucks a day. If you're blogging and you're just starting out you're not going to generate 100 bucks a day to support your you know your and we were sitting there launch your Internet business on you know basically less than 50 bucks. So that would have completely blown it all up. So we immediately said no that is not an option that we're going to suggest to people. And I don't want to throw Amazon on the bus because I have done the same thing with Azure with with Google. I would say they they what they build makes sense but it also doesn't often make sense for you if you're just like you know playing around with it doing some development some testing and things like that. So minor disasters but they can cost you and I have more than a couple of times spent a little more money than I wanted to on places like special cloud providers and that. I will now toss it to you because I'm probably almost burned through all of our eight to minute 10 minutes already. Yeah so on this one I'm going to take the database one because so many times that has bitten me in the butt because it's like okay you're working and you may think you're in production or you may think you're not in production you're on a different machine depending upon how you interface with the database. If you're like logging in through command line interface they all look the same unless you go do some custom coloring to make sure that the system you're connecting to is the color for what you're on. You know green is good yellow is worn you know like QA and red is broad. Make sure you don't break anything. I'm just going to go a little kind of quickly through this one. So with this I will start with kind of the what to what you should be doing first. We timewise you don't always have this but always back up the system that you're working on make sure you have data backups or system backups. Make sure that the tools that you're using has those color features that I was just mentioning. Also make sure that if you are using tools, make sure you turn off auto commit for any environment other than dev. If you make a mistake you can easily recover if auto commit is off, it may take longer, but it can save your butt so many times. The other thing is again make sure that when you are working or you're making the changes to production, make sure that you've tested these scripts out on lower environments first. Test them on a dev environment, test them on a QA environment, test them on more than one environment to make sure that they work the way you expect them to work. Maybe write up some testing depending upon what language you have to make sure that hey you've got the data there, maybe write some sort procedures to do some testing for you. Sometimes even having sort procedures doing SQL updates for you is a better idea because you can establish rules that if failed you can run everything as a single transaction and roll it all back. Or once it's finished you can do some checks in there and if those checks pass, cool. If those checks fail it could also roll back. So you can kind of build in some disaster recovery with this. But I just have to say more times than not you're going to be in a hurry and all I have to say is before you do delete, you're going to have to delete the whole thing. And all I have to say is before you do delete, update, you make a change. Make sure you know what system you're on because you can make a mistake and be hurting tomorrow. Other thing is back up, even if you don't have a system backup, back up the table that you're working on so hopefully you at least have a snapshot and you can fix the data quickly. Be warned triggers and things of that nature will make that harder. It's better to do a full database backup before you make changes. A couple things I want to throw on that is that yes, you definitely want to do it in a lower level database as opposed to production. One of the things I was taught years ago as a DBA is that whatever you're going to do if you're going to do an update, a delete or anything like that that's going to change stuff, even inserts, depending on how you're doing the insert, is do it as a select first. I know this is a little geeky database stuff, but like if you're going to do delete blank from blank, do the select first and take a look at that and make sure that your row count looks like it should. If you are typing stuff, even though always turn your auto commit off, that is a very important thing. Make sure if you're writing something that the first thing you do is your where clause. I have found too many times that I've tried to do something. I've been doing something and then you accidentally like flub a key and then you have to, you know, hopefully you just have to roll it back. Definitely use color coding wherever possible. Even if I'm using command line shells, I will use stuff that I have. Granted, I'm a Mac person, but I know you can do this with other systems as well. You can change the background of the command line environment that you're in. So if you turn it into something, you can actually change that to make sure that you're in the proper shell. And as Michael said, typically, you know, green is, you know, development yellow for test and red for production. And please don't mess with that kind of stuff. Also, I will just say, do not use root anywhere ever if you can possibly avoid it. I know there's some places you can't in those cases. Don't be afraid to use things like Docker and other things like that and ways to and even AI to generate essentially your own development environment, even if you're dealing production environments that are just to deal with. Find a way to make that. I have I had a customer for years that was there was no way I was going to pull all of their data down. Besides the fact that it was like private and I didn't want to deal with security. It was also just too much stinking data. So I did pull down the entire structure and then used a couple of tools to generate mock data. That kind of stuff will help you immensely. And you can always pull down some specific examples as well to just test your stuff beforehand. Moving on, because this is actually I'm going to fly through this one because this is actually some stuff that we really just touched on sort of like AI was Michael's not looking. But it was like, you know, thinking the same thing. So what should happen? Developer to that disaster recovery 101. Break it down like a postmortem for focusing on how to build better recovery habits. Venture back beach flags really access and prod detection monitoring and alerting, for example, sentry data dog, Prometheus, blah, blah. There's so many tools out there depending on where you're at, what your environment is. There are monitoring and logging tools that can help you out. There's even some things that will give you like warnings of, hey, you are doing something that is going to affect more than five rows. Are you really, really sure you want to do that? Monitoring and alerting is great. Rollbacks. We've talked about canary deployments, hotfix versus proper patching. This is again, is the kinds of things is where you're going to go test it out before you actually do it. Communication, have an on call policy, transparent incident update. This is important. I think even if you're in a small team, you should be regularly letting people know when you screwed up because I have a team that has got some people that are newer developers. A lot of times we're new into an environment, things like that. It is very helpful to have in our stand ups mentions of, by the way, I did this and it turned out into something that I did not expect because that helps other people know that what the relationships are on your data, what some of the the potholes are and things like that. Runbooks documentation, runbooks, disaster recovery plans and retrospectives. I highly recommend anywhere that you can use automation of any sort, including runbooks, including even the things we've talked about for like ant shell scripts. You know, if you're depending on what your environment is, things like Maven, even like continuous, you know, CICD tools like your Jenkins and things like that pipelines. All those even a platform as a as code, those kinds of things. The more you can automate it, the better. That is one of the reasons I've become more and more a fan of using things like Docker Docker desktop for current environments, because then you could just like code the whole thing out. And then if you need to replicate the environment, bam, just run it and you're off and going. And you've got a lot of other containers like that. Now dive into the next section, which is Act three tools and techniques for D.R. readiness. This is where I want us. I think we're going to have some good conversation here. Share tools and frameworks, infrastructure, terraform, Ansible, Kubernetes, Helm charts, CICD, GitHub actions, GitLab, Jenkins, backups, automate snapshots, D.R. regions, testing chaos engineering like Gremlin, Netflix's Chaos Monkey load testing, K6 and Locust. There are a lot of tools that we just mentioned that we I just mentioned in that little list that I don't think we use enough unless we are we'll use them at work. We'll use them if we're working for a company or an employer. And there are the kinds of things like it because sometimes they require some time and some money and things like that set up. And we don't necessarily have it. But I think if you can make that part of your technical roadmap as you grow out your side hustle, your these things that you you definitely need. Obviously, I'd say obviously, but I'm going to start backups like automating snapshots and making sure this is if you're using cloud, I beg you to make sure that once you put that server, even if it's you know, Bob's pet store street, once you set up that server for him, take a snapshot and then just hold on to that. Because what you can do is from that snapshot, you can generate another instance almost instantaneously. I would actually say take that and share it out to another zone. So, for example, if they're on an East Coast zone and it goes down, have it out there in the West Coast zone so you can spin something up very quickly. This is called disaster recovery. It's at a very low and it's you don't have to spend much time or money to do so. Once you do it, I think you'll realize how simple and addictive it is. That's where I'm going to stop because I know Michael has dealt with some of these things as well. And I'm curious where you want to go on this long list I just provided. So I'm actually going to just kind of keep it simple. So from a developer's perspective, if you're new to if a lot of these terms are new to you and or you've touched on one thing or another, but not all of it, start playing with containers. If you are not using containers today, start now. I basically have beaten the horse dead with the kitchen sink idea, but you can build a kitchen sink application or model using containers and these frameworks like loader. Like disaster recovery, spinning up production environments, essentially anything that you do. Throw it in a container, replicate it, make sure it can be replicated. You can share those containers between developers. You can then take that container, stick it, literally take that image, push it up to another environment and see if it scales. If it doesn't, well, make it beefier. See if you can play with it. Containers let you do so much so quickly and they're really very quick to build and easy to throw away. So if you mess it up, drop it, rebuild it, start again. These are also very good ways to kind of centralize your development environments so you can create one development environment, get all your tools in there for your developers for what they need, share that to everyone. They can use that. They break it, drop it, rebuild it. All your code should be in a code repository, so you should be able to pull it up and down. Backups are easy. Just take a snapshot. Done. Literally within the world of testing, this is what should be happening. And a lot of your big corporate companies are using things like, you know, Sonar Bear, TestNG, and they're using things like, you know, Sonar Bear, TestNG, Cucumber. They use these things, but they also use cloud-based grid kind of ideas to differentiate and test multiple systems at once, to test multiple. You can run multi-threaded tests against your current systems. Start playing around with it if you're not doing that now. These are things that you should and really need to learn and know to be able to succeed in enterprise. And also you want to make sure that your products that you build for your customers can scale. And this is the best way to do it. And it really is. It's one of those that I don't know how often things have gotten lost in configuration. As a developer, I spent over the last year, my entire time doing configuration issues and fixing those kinds of things. I know my team has regularly had, like every project, there's been at least at some point where each developer probably loses a day or two on configuration issues and things like that. Containers can really help you move that kind of stuff forward and homogenize the environment so you don't have to worry about, you know, Bob's got this setup. And Al's got this other one. And Sam's got a third setup. And then none of it's, you know, and then once they commit stuff that doesn't always work, those kinds of things can get actually repaired very quickly. And then especially when you get into complicated things where you have to configure an environment, one person can do it and then they share it out and you don't have to worry about everybody else having to go through that process. Now dive into the last section because we're running out of time. I'll just sort of cover this in a high level. Calm. Promote the human side of handling the matters. Blameless post mortems. Stay calm under pressure. Nobody codes well in panic. It's not if it's not if it's when and how you respond. Now, I just want to say. Stuff happens. There is going to be things where there's going to be bad queries. There's going to be something that happens in a search down. There's going to be things that a drive will fail. If you go back to our season, we talked about lessons learned from mistakes. I think there were three or four episodes at least that were mistakes where the disaster recovery plan was not in place beforehand. Test them. And really, this stuff can be so simple at times. I know, especially from. So if you're sitting there and you're building an Apache Web Appleton and you're just throwing in this one folder, how hard is it to back up that folder and put it somewhere else and just put it on another machine? You can like if you're if somebody's paying you to develop, you should be able to afford having your own local development environment that is different from production that is close enough. That you can make it work or something along those lines, especially if you're using containers, you should be able to replicate that stuff good enough to be able to do some testing and do even dr testing beforehand. More importantly, shoot us an email at info at developer.com because we would love to hear about what is it? What is it? This is you're going through. What are some of your dr disasters? Where are some situations where something happened and you guys? I also like what are some of the things that helped you where, okay, we did this thing that is not the norm, but it helped us through a disaster. And then some of the tools that you use, because there are a lot of them out there. And again, a lot of them are environment will say dependent. However, there's a lot out there and it's always good to hear about new ones. We'd love to throw that out to the group and just say, hey, by the way, here's something else you guys can check out. That being said, I'm going to wrap this one up as always. Leave us obviously the email, but shoot us something at developer.com. We've got anywhere you want to. We've got articles and all kind of stuff there. Leave us a comment. Leave us just any kind of review there. Anywhere that you're listening to podcasts. If you're finding a place for podcasts and we're not there, we'll get there. YouTube at developer Nour out at that channel at developer Nour on Facebook. We apparently have a page out there and lots of other places like that. So let us know anywhere you want to find us. If you don't, we will find a way to get there sooner or later. It may take us a couple of minutes. That being said, go out there and have yourself a great day, a great week. And we will talk to you next time. Bye.