📺 Develpreneur YouTube Episode

Video + transcript

Software Fire Fighting - Part 2

2022-07-07 •Youtube

Detailed Notes

Welcome, today we are going to be talking about one of the most talked-about, but overlooked discussions about software and that is technical support. What do we do when software has a problem? In today's discussion, we're going to look at different things like production support. What does it mean to actually support our product?

In past discussions, we've talked about software development life cycle, software test life cycle, differences in how to write code, how to debug our code and how to test it. However, we really haven't talked too much about deployment and production support. This is where we're going to focus our discussion today. How do we support our product and our customers?

We're going to look at the different types of problems that can arise, how we can troubleshoot these problems, and different ways that we can fix these problems. Finally, we're going to look at the different ways we can prevent this from happening, in the future.

Technical Support - Fire Fighting Overview:

Production Support Types of Problems Troubleshooting a Problem Ways to Fix the Problem Prevention Root Cause Analysis (RCA) Technical Support - Fire Fighting Issues

Finally, this series comes from our mentoring/mastermind classes.  These classes are virtual meetings that focus on how to improve our technical skills and build our businesses.  After all the goals of each member vary.  However, this diversity makes for great discussions and a ton of educational value every time we meet.  We hope you enjoy viewing this series as much as we enjoy creating it.  As always, this may not be all new to you, but we hope it helps you be a better developer.

Other Classes, You Might Consider:

Test Project A Review of Version 2.0 Integrating testing into your development flow Become a Better Developer Build A Product Catalog Launching an Internet Business

Transcript Text
[Music]
now once we have fixed the problem and
we've identified all the particular
issues that were out there we need to go
back and do what is called a root cause
analysis or rca
the primary purpose of the rca is to
analyze a problem or sequence of events
in order to identify what happened why
it happened and what can be done to
prevent it from happening again
so basically this is that after you do
the whole software development cycle you
go back and you do that review process
you go back in and review what went well
what didn't go well but here for a root
cause analysis we need to look at this
from a end user's perspective
and a risk
so we want to make sure that we've
defined the problem correctly that we
gathered all the information that came
out of not only identifying the
particular problem how we can reproduce
the problem and what we did to fix the
problem so we get all that information
we identified the causal factors
these are the things that actually
triggered what happened
so now we have our data in hand
we want to look at all the different
possibilities that led us to this
problem was it an alphanumeric field
that really should have been alpha only
or numeric only did we have bad data
did we have bad security practices
were we missing logins were missing
role-based security
once we have identified the factors we
can then determine the root cause or
cause is
and here we use some of the root cause
analysis tools that are on the market
we can also go in and just kind of dig
through what happened what did we do to
actually fix the problem
so this could determine the root cause
is a combination of what the developer
or networking teams or the different
groups went in to figure out we take how
we figure out we have the bug but then
what did we do to fix it
now we found the bug
how did we fix it and then the last one
is okay now that we fixed it now that we
have things stable again what can we do
to prevent this from happening again
what is the recommended and implemented
solutions going forward that we can put
in place to stop the bleeding
make sure this doesn't happen again
so there are different things you could
do this could also be training
this could also be implement additional
tools add additional logging and alerts
to your system add system monitoring
those different agents you can put on
your pcs
these are just some of the things that
you can do at the end for the root cross
analysis to make sure that everything is
cohesive that you have identified what's
going on and what happened and how you
can prevent this from happening again
people are going to be asking why this
happened
now as a developer or depend upon where
you're at in the company
if you are constantly asked why
something broke you're going to start
getting defensive and with the root
cause analysis you have to try to
detract yourself from that don't take
this personally if you make a mistake
own up to it say hey i made a mistake in
this process or the code had this
particular problem in it here's what
happened
this is not a blame game
this is basically a cause and effect
what caused the issue
what was the effect and how do we
prevent it from happening again
when you're doing a root cause analysis
try to take the emotions out of it i
know it's hard but the rcas are very
critical to ensure that you do not have
problems in the future and the only way
to do that is to really be honest when
you do your research here and you
actually go through and do the analysis
and identify what happened
finally before we leave this page the
last thing i recommend from an rca is
even though you've done the rca even
though we've identified all these things
the recommended and implemented
solutions are identified
i also recommend you implement some type
of training practice
training could be a part of the
recommendation but if you have a
production support customer support it's
still good to do either monthly
trainings depending upon your release
cycles or at least once a sprint or once
a release cycle you definitely get your
support teams involved so that they get
the training on what's coming
what are some of the issues that have
been found and identified if you're on
the support side of things i definitely
recommend that you do at least
monthly if not quarterly review of all
your rcas to make sure that you don't
see a pattern do we see a bad pattern in
our software are things breaking what is
going on
and that can also help you identify root
causes in organizational problems
strategies release cycle problems
development problems so rcas give us a
lot of information and from that you
want to do add additional things like
training
lastly i want to leave you with avoid
firefighting
so when we get into this routine of
software support and production support
customer support we get through fixing
problems
we potentially run the risk of getting
into a firefighting mode basically every
time a customer calls has a potential
problem we go out and we quickly fix it
so all we're doing is hot fixes
we're not going through a normal
software process
so in a sense firefighting is just it
says here so firefighting is just like a
real world where the assumption that a
fire or where fires are unpredictable
and that they may be dealt with
immediately or have to be dealt with
immediately
again hot fixes
if this tool is used too frequently we
end up in like an emergency action mode
all the time and this could lead to poor
planning lack or organizational support
and is likely to tie up all of our
resources that could be used elsewhere
now to keep firefighting to a minimum a
comprehensive disaster plan might be
implemented
we could think about if something
critical happens how do we address that
as a company do we have things like
failover if the system goes down are we
in a load bouncer
if one machine goes down we just take it
out of the cycle and put another one in
so are things that can be done to help
prevent the firefighting model
however i've actually been in this
situation currently where firefighting
unfortunately can be a necessary mode to
go into for a while when you're dealing
with some critical issues
for instance if you're dealing with a
legacy application that is so old you
have no one left in the organization to
support it
you have very high technical debt
no one knows how things work
uh system is unstable you can't even go
back and patch the software anymore
it's that old or you could be running
into unexpected side effects of a code
change or an incomplete software deploy
that went out to a system incorrectly
in these situations you potentially
could end up firefighting
where you're just constantly you get one
issue fixed another one rises you jump
to that one another one and it's just
repeating pattern that you don't want to
get into
yes in some situations you're stuck here
however you still want to step back at
the end of every hot fix or the end of
every bug fix
and say hey do we have an understanding
of what is causing these issues
did we do our rca if we did our rca
correctly we should start to see a
pattern and be able to identify what's
causing this critical failure or these
firefighting mode
and how do we get out of it what is our
path to get out of it because we need to
make sure we have our resources freed up
so that we can continue doing normal
business functionality get our software
built correctly and keep things moving
out the door
so just to quickly recap we talked about
the overall process of production
support we talked about coming in at the
customer support level where the
customer support rep talks to the
customer identifies what the issue is
with the customer what type of bugs or
problems they're having
they get the people involved
they troubleshoot the type of problem it
is
is it customer-facing
is it internal back-end
is it a configuration issue
is it just simply the user forgot their
password we go through a troubleshooting
process of walking through
internally trying to reproduce the bug
try identify what's going on
again don't panic
take that breath don't take it
personally keep it all on the level and
keep everyone informed
we also talked about identifying the
different types of risks the risk
analysis validation and the particular
problems you can run into
we also talked about the different types
of fixes we could do
we talked about the hotfix fixing it
real time right now
doing normal bug fixing during our
software development life cycle testing
our code finding bugs fixing them and
preventing those bugs from going out in
a release
we talked about maintenance windows
where we take the systems down for
scheduled maintenance where we do system
patching uh updates system reboots
things of that nature and then we talked
about patching where instead of doing a
hot fix we actually go through a very
small release cycle where we fix a
combination of bugs get them tested and
then roll that out in a timely manner
where we actually test it not just
quickly reacting to the customer
complaint to the fire
and then we talked about prevention
doing the root cause analysis
identifying what caused the problem why
it happened and what can we do to
prevent training
better testing things of that nature and
the last thing we talked about was avoid
firefighting try not to get into that
mode of always being reactive
we don't want to be in the mode of
always doing hot fixes just reacting to
a customer's complaint we want to be
mindful and think through the process so
that we make sure we're not adding
additional bugs into the system and
testing our software thoroughly so with
that i will open up for questions
mike i had a production issue is that
the most scary part for a software
developer when it comes to production or
is it deadline when you're gonna go live
like what scares you when you roll the
software to production
one of the biggest things personally
that i deal with with going to
production is for the different fields
i've been in is health care and ph
patient health information or financial
information
typically
yes deadlines are scary but deadlines
can come and go they're an arbitrary
number
in some cases sometimes they're not like
when dealing with government health care
laws
one of the biggest concerns constantly
with any type of software development is
data integrity making sure that your
data is not being exposed that should
not be exposed and is being encrypted
correctly i've been in a couple
situations where we
or phi potentially could have been
leaked thankfully it wasn't and we
caught it in time but that's the biggest
concern is when you have a release that
goes out and something critically fails
you always as a developer
fear that you missed something something
wasn't done right now typically in
larger software organizations you have
larger qa departments and there's a lot
more testing involved but when you're
dealing with smaller startups or
typically like just smaller software
shops you don't have that larger qa so
you don't have the time to necessarily
do a full regression of your software so
when you make a code change you don't
know
there is that risk
especially if you have to make a hot fix
if there is a production issue something
sound you fix it you always have that
nagging concern or fear
did i break something else
was there something else that we missed
why did it go down and that's why you
want to do those things like that root
cause analysis
i don't know from your perspective rob
what you would run into but that that's
my perspective
yeah the scariest thing actually
among those it's probably it probably is
production issues
because you've got and particularly with
healthcare and something where you've
got something critical if it's an app
that's uh i don't know if it's like it
was an angry birds app or something like
that or it's just people playing games
or something that's for fun
that's and even if it's something where
people are kind of like if it was like a
radio app or something where you're
listening to a radio show or tv show or
something like that it's not critical
necessarily but when you talk about
something that is it's either like the
lifeblood of the business where if it's
not running they're not generating
revenue or somebody's literal life blood
where it's health and somebody could get
sick or could die because of some bug or
something like that
those obviously are very scary but
outside of that i think in general it's
actually the ones that probably are the
scariest most annoying or whatever
and this sort of goes to michael's topic
in general are the ones that keep
showing up when you have something and
it's fixed whether you say you're six or
you think it's fixed or somebody else
that's on the team has fixed it
and then essentially the same bug
appears again is it shakes your trust in
whether it's fixed the next time around
and so it's it is one of those where it
gets into the and this gets to so again
some of those steps michael mentioned
where you really say okay let's step
back
let's look at this like root cause what
really causes
let's make sure that this is something
that
when we fix it this time we can actually
fix it sometimes it is very much a case
where you look at it and you say oh this
is the reason that this bug occurred
and so we'll just change this thing and
we'll be off and run and we're fine
but what you end up doing is you're it's
more like you're putting a band-aid on
it you're not fixing it if you don't
look into it enough sometimes you fix
the
you're correcting for the result and
you're not fixing the actual bug
and so
it it can and that can be really that
can be really nasty because what happens
in some cases you end up
you build this
layer or infrastructure that corrects
for this bug on multiple situations
then you find out after the fact after
you've put in several fixes and made
these changes that oh
those things really weren't needed there
was this other approach i was supposed
to take
so now you get in the situation where
you have your fix you figured it out you
fixed it but then you have to go back
and find all of those places where you
made those adjustments to change for it
and basically roll those back
because what you were doing is you were
doing this this one off kind of fix for
these even if you didn't know it and now
you've got to go track them back down
which also goes to
there goes the whole idea of version
control and trying to keep good
which is very rare good comments about
or at least a ticket you know some sort
of ticket number which is more common
it's done as far as what that change was
what it was for
so in this case if i had 15 tickets
related to this bug in the past
now i go back and fix it and i realize i
need to go undo those changes
i can go back to those 15 tickets and i
can look at what was changed i can look
at those code commits figure out what
was changed and roll it back
it's not fun it's not pretty it's going
to be probably tedious and time
consuming but at least i can be thorough
i can actually look at it and say yes
these are all the places i changed it
and this is where i need to roll them
back
i did want to mention that i think the
thing that i like that you mentioned
there early on michael is the don't
panic obviously there's the douglas
adams thing that just jumps to people's
mind but it really is
i think that's the that is the cause for
stumbling into firefighting and stuff
like that more often than not is it
there's this
we found a bug
we want to fix it immediately
and it's almost like a panic it's almost
like a it's like a rush to get it done
as you mentioned we don't step back take
a breath and say okay what really
happened
what do we really need to do to fix it
and i think if you can keep a cool head
that is a huge first step in dealing
with these things properly so that you
assess the problem you figure out a
solution and you fix it once and you're
done as opposed to it keeps recurring
and i guess on that one of the things to
mention is if you get into a
firefighting mode and you're constantly
doing hotfix your code is eventually
going to become spaghetti code because
you're not taking the time to go back
and review things correctly and go
through the normal software process
and you'll run into the risk of is your
code reaching unmaintainable or unstable
place just from that continuous hotfix
mode that's an excellent point and it
does make it but just the code itself
makes it my unmaintainable at times
because you just say you get almost like
the spaghetti of all these little
patches and little things but also
usually because you're doing this
quickly it's that whole firefighting
mode
you're not you're shortcutting some of
the processes so your documentation is
not up to date your comments aren't up
to date you know you may have it but
there's a lot of cases where you see
this really nice first version of code
and there were a bunch of fixes and it
just makes the whole thing a mess
because you can see where at some point
it looked like it was done right and
properly thought out and documented all
the all the t's were crossed and the
eyes were dotted
and then you find these like jumbles of
code like why is this here or you'll
find magic numbers and all kinds of
other just bad things showing up in the
code that yeah you started off great
and then
because you didn't follow through and
keep your head you essentially ruined a
good thing
excellent point
i would like to thank everyone for your
time today
we appreciate your time
and if you'd like to discuss any of our
topics further you can reach out to us
at info developer new dot com you can
reach us on our website development.com
contact us we're also on twitter at
developernerd we're on facebook which we
get that occasionally we're at
facebook.com developmenter
we're also on vimeo you can find our
videos on vimeo and we now have a new
youtube channel
you can google developer on youtube and
find our videos there our goal is making
every developer better have a wonderful
day
[Music]
you
Transcript Segments
0.43

[Music]

27.68

now once we have fixed the problem and

30.24

we've identified all the particular

32.079

issues that were out there we need to go

34.48

back and do what is called a root cause

37.2

analysis or rca

39.36

the primary purpose of the rca is to

42.32

analyze a problem or sequence of events

44.879

in order to identify what happened why

47.44

it happened and what can be done to

48.879

prevent it from happening again

51.52

so basically this is that after you do

53.84

the whole software development cycle you

55.76

go back and you do that review process

58.239

you go back in and review what went well

61.199

what didn't go well but here for a root

64.08

cause analysis we need to look at this

66.64

from a end user's perspective

69.28

and a risk

71.84

so we want to make sure that we've

73.28

defined the problem correctly that we

75.52

gathered all the information that came

77.759

out of not only identifying the

80

particular problem how we can reproduce

82.08

the problem and what we did to fix the

84.24

problem so we get all that information

87.28

we identified the causal factors

89.92

these are the things that actually

91.439

triggered what happened

93.759

so now we have our data in hand

96.079

we want to look at all the different

97.759

possibilities that led us to this

99.84

problem was it an alphanumeric field

102.32

that really should have been alpha only

104.479

or numeric only did we have bad data

107.68

did we have bad security practices

110.96

were we missing logins were missing

113.68

role-based security

116.079

once we have identified the factors we

118.88

can then determine the root cause or

121.119

cause is

122.24

and here we use some of the root cause

124.56

analysis tools that are on the market

127.439

we can also go in and just kind of dig

130.08

through what happened what did we do to

132.72

actually fix the problem

134.64

so this could determine the root cause

136.72

is a combination of what the developer

139.12

or networking teams or the different

141.04

groups went in to figure out we take how

143.76

we figure out we have the bug but then

145.52

what did we do to fix it

147.92

now we found the bug

149.44

how did we fix it and then the last one

152.239

is okay now that we fixed it now that we

154.8

have things stable again what can we do

157.599

to prevent this from happening again

160.239

what is the recommended and implemented

162.239

solutions going forward that we can put

164.239

in place to stop the bleeding

166.959

make sure this doesn't happen again

169.36

so there are different things you could

170.879

do this could also be training

173.519

this could also be implement additional

176

tools add additional logging and alerts

178.959

to your system add system monitoring

181.519

those different agents you can put on

183.12

your pcs

184.48

these are just some of the things that

185.84

you can do at the end for the root cross

188.72

analysis to make sure that everything is

191.68

cohesive that you have identified what's

194

going on and what happened and how you

196.239

can prevent this from happening again

198.959

people are going to be asking why this

200.959

happened

201.92

now as a developer or depend upon where

204.879

you're at in the company

206.56

if you are constantly asked why

208.4

something broke you're going to start

210.4

getting defensive and with the root

212.56

cause analysis you have to try to

214.959

detract yourself from that don't take

217.36

this personally if you make a mistake

220

own up to it say hey i made a mistake in

222.72

this process or the code had this

224.959

particular problem in it here's what

227.2

happened

228.08

this is not a blame game

230.72

this is basically a cause and effect

233.28

what caused the issue

234.959

what was the effect and how do we

236.56

prevent it from happening again

238.959

when you're doing a root cause analysis

241.12

try to take the emotions out of it i

243.28

know it's hard but the rcas are very

246.64

critical to ensure that you do not have

249.12

problems in the future and the only way

251.599

to do that is to really be honest when

253.92

you do your research here and you

256

actually go through and do the analysis

258

and identify what happened

260.239

finally before we leave this page the

263.04

last thing i recommend from an rca is

266.56

even though you've done the rca even

268.32

though we've identified all these things

270.32

the recommended and implemented

271.84

solutions are identified

273.84

i also recommend you implement some type

276.479

of training practice

278.88

training could be a part of the

280.639

recommendation but if you have a

282.32

production support customer support it's

284.72

still good to do either monthly

287.36

trainings depending upon your release

289.44

cycles or at least once a sprint or once

292.56

a release cycle you definitely get your

294.96

support teams involved so that they get

297.04

the training on what's coming

299.199

what are some of the issues that have

300.8

been found and identified if you're on

303.199

the support side of things i definitely

305.36

recommend that you do at least

307.84

monthly if not quarterly review of all

310.24

your rcas to make sure that you don't

313.199

see a pattern do we see a bad pattern in

316.24

our software are things breaking what is

318.96

going on

320.8

and that can also help you identify root

323.52

causes in organizational problems

326

strategies release cycle problems

328.639

development problems so rcas give us a

331.84

lot of information and from that you

334.479

want to do add additional things like

336.24

training

337.919

lastly i want to leave you with avoid

340.639

firefighting

341.919

so when we get into this routine of

344.72

software support and production support

347.199

customer support we get through fixing

350

problems

351.199

we potentially run the risk of getting

353.919

into a firefighting mode basically every

356.88

time a customer calls has a potential

359.199

problem we go out and we quickly fix it

362.88

so all we're doing is hot fixes

365.28

we're not going through a normal

366.319

software process

367.84

so in a sense firefighting is just it

371.039

says here so firefighting is just like a

373.6

real world where the assumption that a

376.08

fire or where fires are unpredictable

378.56

and that they may be dealt with

379.919

immediately or have to be dealt with

381.6

immediately

383.12

again hot fixes

385.12

if this tool is used too frequently we

388.4

end up in like an emergency action mode

390.72

all the time and this could lead to poor

392.96

planning lack or organizational support

396.24

and is likely to tie up all of our

398.08

resources that could be used elsewhere

400.8

now to keep firefighting to a minimum a

403.199

comprehensive disaster plan might be

406

implemented

407.6

we could think about if something

409.599

critical happens how do we address that

412.16

as a company do we have things like

414.24

failover if the system goes down are we

416.8

in a load bouncer

418.319

if one machine goes down we just take it

420.479

out of the cycle and put another one in

423.36

so are things that can be done to help

425.12

prevent the firefighting model

428.08

however i've actually been in this

430.16

situation currently where firefighting

432.72

unfortunately can be a necessary mode to

436.16

go into for a while when you're dealing

438.56

with some critical issues

440.8

for instance if you're dealing with a

442.479

legacy application that is so old you

445.36

have no one left in the organization to

447.44

support it

448.56

you have very high technical debt

451.28

no one knows how things work

453.919

uh system is unstable you can't even go

456.4

back and patch the software anymore

458.639

it's that old or you could be running

461.68

into unexpected side effects of a code

464.24

change or an incomplete software deploy

466.72

that went out to a system incorrectly

470.319

in these situations you potentially

472.479

could end up firefighting

474.4

where you're just constantly you get one

476.72

issue fixed another one rises you jump

478.96

to that one another one and it's just

481.52

repeating pattern that you don't want to

483.44

get into

484.479

yes in some situations you're stuck here

488.319

however you still want to step back at

491.44

the end of every hot fix or the end of

494.319

every bug fix

495.759

and say hey do we have an understanding

498.879

of what is causing these issues

501.44

did we do our rca if we did our rca

504.72

correctly we should start to see a

506.879

pattern and be able to identify what's

509.28

causing this critical failure or these

511.52

firefighting mode

513.2

and how do we get out of it what is our

515.68

path to get out of it because we need to

517.919

make sure we have our resources freed up

519.839

so that we can continue doing normal

521.76

business functionality get our software

524

built correctly and keep things moving

525.839

out the door

527.36

so just to quickly recap we talked about

530.08

the overall process of production

532.08

support we talked about coming in at the

534.56

customer support level where the

536.8

customer support rep talks to the

538.64

customer identifies what the issue is

541.6

with the customer what type of bugs or

543.519

problems they're having

545.04

they get the people involved

547.44

they troubleshoot the type of problem it

549.68

is

550.959

is it customer-facing

552.64

is it internal back-end

554.8

is it a configuration issue

557.04

is it just simply the user forgot their

559.2

password we go through a troubleshooting

562.08

process of walking through

564.399

internally trying to reproduce the bug

567.2

try identify what's going on

569.519

again don't panic

571.68

take that breath don't take it

573.839

personally keep it all on the level and

576.64

keep everyone informed

579.12

we also talked about identifying the

580.959

different types of risks the risk

583.04

analysis validation and the particular

586.08

problems you can run into

587.76

we also talked about the different types

590.08

of fixes we could do

591.6

we talked about the hotfix fixing it

593.6

real time right now

595.76

doing normal bug fixing during our

597.68

software development life cycle testing

599.68

our code finding bugs fixing them and

602.16

preventing those bugs from going out in

604.16

a release

605.76

we talked about maintenance windows

607.44

where we take the systems down for

609.36

scheduled maintenance where we do system

611.519

patching uh updates system reboots

614.16

things of that nature and then we talked

616

about patching where instead of doing a

618.079

hot fix we actually go through a very

619.92

small release cycle where we fix a

622

combination of bugs get them tested and

625.04

then roll that out in a timely manner

627.839

where we actually test it not just

629.68

quickly reacting to the customer

632.399

complaint to the fire

634.16

and then we talked about prevention

635.68

doing the root cause analysis

637.92

identifying what caused the problem why

640.24

it happened and what can we do to

642.399

prevent training

644.079

better testing things of that nature and

647.04

the last thing we talked about was avoid

648.88

firefighting try not to get into that

651.68

mode of always being reactive

654.48

we don't want to be in the mode of

656.8

always doing hot fixes just reacting to

659.519

a customer's complaint we want to be

662.079

mindful and think through the process so

664.72

that we make sure we're not adding

666.399

additional bugs into the system and

668.32

testing our software thoroughly so with

670.56

that i will open up for questions

673.76

mike i had a production issue is that

676.079

the most scary part for a software

678

developer when it comes to production or

680.24

is it deadline when you're gonna go live

682.8

like what scares you when you roll the

684.48

software to production

686.959

one of the biggest things personally

688.48

that i deal with with going to

690.24

production is for the different fields

692.64

i've been in is health care and ph

695.92

patient health information or financial

698.48

information

699.6

typically

700.88

yes deadlines are scary but deadlines

704

can come and go they're an arbitrary

706.399

number

707.279

in some cases sometimes they're not like

709.68

when dealing with government health care

711.6

laws

713.04

one of the biggest concerns constantly

715.36

with any type of software development is

718

data integrity making sure that your

720.48

data is not being exposed that should

722.88

not be exposed and is being encrypted

724.959

correctly i've been in a couple

726.88

situations where we

728.56

or phi potentially could have been

730.24

leaked thankfully it wasn't and we

732.72

caught it in time but that's the biggest

734.639

concern is when you have a release that

736.8

goes out and something critically fails

740

you always as a developer

742.399

fear that you missed something something

744.8

wasn't done right now typically in

746.8

larger software organizations you have

749.519

larger qa departments and there's a lot

751.36

more testing involved but when you're

753.04

dealing with smaller startups or

755.6

typically like just smaller software

757.36

shops you don't have that larger qa so

759.92

you don't have the time to necessarily

762

do a full regression of your software so

764.399

when you make a code change you don't

766.399

know

767.36

there is that risk

769.6

especially if you have to make a hot fix

771.519

if there is a production issue something

773.519

sound you fix it you always have that

776

nagging concern or fear

778

did i break something else

779.76

was there something else that we missed

782.399

why did it go down and that's why you

784.32

want to do those things like that root

785.68

cause analysis

787.2

i don't know from your perspective rob

789.04

what you would run into but that that's

790.88

my perspective

792.639

yeah the scariest thing actually

795.36

among those it's probably it probably is

797.12

production issues

798.72

because you've got and particularly with

801.04

healthcare and something where you've

802.16

got something critical if it's an app

804.32

that's uh i don't know if it's like it

806.16

was an angry birds app or something like

807.76

that or it's just people playing games

809.36

or something that's for fun

811.68

that's and even if it's something where

813.36

people are kind of like if it was like a

815.519

radio app or something where you're

816.72

listening to a radio show or tv show or

818.639

something like that it's not critical

820.72

necessarily but when you talk about

822.56

something that is it's either like the

824.8

lifeblood of the business where if it's

826.88

not running they're not generating

828.399

revenue or somebody's literal life blood

830.639

where it's health and somebody could get

832.079

sick or could die because of some bug or

834.32

something like that

836

those obviously are very scary but

838.88

outside of that i think in general it's

841.279

actually the ones that probably are the

843.519

scariest most annoying or whatever

845.76

and this sort of goes to michael's topic

847.76

in general are the ones that keep

849.279

showing up when you have something and

851.92

it's fixed whether you say you're six or

854.56

you think it's fixed or somebody else

856.16

that's on the team has fixed it

858.72

and then essentially the same bug

860.399

appears again is it shakes your trust in

863.92

whether it's fixed the next time around

866.079

and so it's it is one of those where it

868

gets into the and this gets to so again

870.32

some of those steps michael mentioned

871.76

where you really say okay let's step

873.839

back

875.04

let's look at this like root cause what

877.36

really causes

879.199

let's make sure that this is something

880.88

that

881.68

when we fix it this time we can actually

884.079

fix it sometimes it is very much a case

886.399

where you look at it and you say oh this

888.639

is the reason that this bug occurred

891.6

and so we'll just change this thing and

893.12

we'll be off and run and we're fine

895.519

but what you end up doing is you're it's

897.279

more like you're putting a band-aid on

899.12

it you're not fixing it if you don't

901.199

look into it enough sometimes you fix

903.199

the

904.16

you're correcting for the result and

905.68

you're not fixing the actual bug

908

and so

909.12

it it can and that can be really that

911.68

can be really nasty because what happens

913.68

in some cases you end up

915.6

you build this

917.04

layer or infrastructure that corrects

919.12

for this bug on multiple situations

922

then you find out after the fact after

924.079

you've put in several fixes and made

925.6

these changes that oh

927.6

those things really weren't needed there

929.519

was this other approach i was supposed

931.44

to take

932.399

so now you get in the situation where

933.92

you have your fix you figured it out you

936.639

fixed it but then you have to go back

938.16

and find all of those places where you

940.56

made those adjustments to change for it

943.12

and basically roll those back

946.16

because what you were doing is you were

947.44

doing this this one off kind of fix for

950.079

these even if you didn't know it and now

952

you've got to go track them back down

954.399

which also goes to

956.079

there goes the whole idea of version

957.519

control and trying to keep good

960.48

which is very rare good comments about

962.8

or at least a ticket you know some sort

964.32

of ticket number which is more common

966.72

it's done as far as what that change was

969.36

what it was for

971.6

so in this case if i had 15 tickets

973.92

related to this bug in the past

975.839

now i go back and fix it and i realize i

977.759

need to go undo those changes

980.24

i can go back to those 15 tickets and i

982.24

can look at what was changed i can look

984.24

at those code commits figure out what

986.32

was changed and roll it back

988.72

it's not fun it's not pretty it's going

991.36

to be probably tedious and time

992.56

consuming but at least i can be thorough

995.759

i can actually look at it and say yes

998.079

these are all the places i changed it

1000

and this is where i need to roll them

1001.6

back

1002.399

i did want to mention that i think the

1004.48

thing that i like that you mentioned

1005.759

there early on michael is the don't

1008.24

panic obviously there's the douglas

1010.24

adams thing that just jumps to people's

1011.68

mind but it really is

1013.839

i think that's the that is the cause for

1017.04

stumbling into firefighting and stuff

1018.72

like that more often than not is it

1020.399

there's this

1021.6

we found a bug

1022.959

we want to fix it immediately

1025.039

and it's almost like a panic it's almost

1027.28

like a it's like a rush to get it done

1030.799

as you mentioned we don't step back take

1032.72

a breath and say okay what really

1034.799

happened

1035.839

what do we really need to do to fix it

1037.919

and i think if you can keep a cool head

1040

that is a huge first step in dealing

1042.559

with these things properly so that you

1044.24

assess the problem you figure out a

1046.16

solution and you fix it once and you're

1047.839

done as opposed to it keeps recurring

1050.64

and i guess on that one of the things to

1052.48

mention is if you get into a

1055.28

firefighting mode and you're constantly

1057.52

doing hotfix your code is eventually

1060.4

going to become spaghetti code because

1062.559

you're not taking the time to go back

1064.32

and review things correctly and go

1066.16

through the normal software process

1068.64

and you'll run into the risk of is your

1070.64

code reaching unmaintainable or unstable

1073.36

place just from that continuous hotfix

1075.84

mode that's an excellent point and it

1078.559

does make it but just the code itself

1080.72

makes it my unmaintainable at times

1082.64

because you just say you get almost like

1084.08

the spaghetti of all these little

1085.679

patches and little things but also

1088.24

usually because you're doing this

1089.679

quickly it's that whole firefighting

1091.52

mode

1092.4

you're not you're shortcutting some of

1094.16

the processes so your documentation is

1096.48

not up to date your comments aren't up

1098.32

to date you know you may have it but

1101.12

there's a lot of cases where you see

1102.48

this really nice first version of code

1105.919

and there were a bunch of fixes and it

1107.76

just makes the whole thing a mess

1109.76

because you can see where at some point

1111.36

it looked like it was done right and

1113.039

properly thought out and documented all

1114.88

the all the t's were crossed and the

1116.4

eyes were dotted

1117.76

and then you find these like jumbles of

1119.44

code like why is this here or you'll

1121.679

find magic numbers and all kinds of

1123.36

other just bad things showing up in the

1126

code that yeah you started off great

1128.96

and then

1130.08

because you didn't follow through and

1131.84

keep your head you essentially ruined a

1133.679

good thing

1135.039

excellent point

1137.12

i would like to thank everyone for your

1138.799

time today

1140

we appreciate your time

1141.84

and if you'd like to discuss any of our

1143.679

topics further you can reach out to us

1146.64

at info developer new dot com you can

1149.28

reach us on our website development.com

1152

contact us we're also on twitter at

1154.72

developernerd we're on facebook which we

1157.84

get that occasionally we're at

1159.44

facebook.com developmenter

1161.84

we're also on vimeo you can find our

1164.72

videos on vimeo and we now have a new

1167.039

youtube channel

1168.48

you can google developer on youtube and

1170.96

find our videos there our goal is making

1173.36

every developer better have a wonderful

1175.52

day

1177.14

[Music]

1191.76

you