Building Better Developers

Detailed Notes

After our series of building a python application to synchronize databases at the schema and data level we have lessons learned. There are areas where we need to be intentional as we decide how to maintain and enhance our databases. This is part 2.

Learn More at https://develpreneur.com

Transcript Text

thank you
[Music]
so we get to indexes and keys which are
sort of a they they feel a little bit
like columns because typically what you
have is you've got like sort of like
with store procedures you have a sort of
a name of a index or key which is a lot
of cases secondary because really it's
the key the important things is what
owns it you know like let's say it's uh
what table is this index on and then
what column or columns yes basically
that signature is really what matters
the name is really not significant for
the most part other than if you're
trying to compare because sometimes you
you know you'll see the way my sequel
works it
you'll see that a column has an index on
it but you have to go digging around
otherwares otherwise other places to
figure out what that index is so you
need and base it ends up being sort of a
key you have to go find the name and you
use that name and then you can figure
out what's the table that it's on what's
the column or columns that it that it's
um this or part of that index and then
things like is this a is it a primary
key is it a unique index
um I think there's cluster there's a
couple others that I didn't even stumble
into yet as I was going through this but
there are different flavors of indexes
and then with those which is sort of a
precursor talking about data is that
there can be uh data Integrity issues
and particularly if you're changing
indexes and foreign keys
you can run into the situation where
it's awesome and your source and your
production data that you're now
replicating down but because of the data
that exists in your target
if you change that index or that key or
add that foreign key it doesn't allow
you to because the data
is has not does not conform to that uh
to that key or that index so for example
if you try to create a unique index on a
column that does not have unique values
it will fail it will say you can't do
that because you don't have unique
values currently and if I try to do this
it will break
so you sort of get it's almost a chicken
and egg kind of situation because what
you need to do is get that
get the data essentially corrected
changed altered first
and then deal with your indexes and keys
but
you sort of need the indexes and keys
there sometimes to help you figure out
what data you're going to move from your
source to your destination
and this is where we get to data issues
are really
through the whole thing are the toughest
things to Think Through
is while the structure is key and it's
actually it would not be it's not
terribly difficult to say if I don't
touch the data and all I want to do
is either synchronize you know it's
basically synchronize everything that I
can and then have a list of changes that
need to be done where I can't which
would be for example
uh I want to create I have a unique
index in production I want to create
that unique index in my target but the
data doesn't allow it have some sort of
like an exception log or something that
just says hey you want to create this or
this is a place this is something that
is not synchronized I can't synchronize
it you know automated because the data
has to change in your development
environment but then you would know just
say oh I'm going to go in I'm going to
tweak my data or delete the rows that
are duplicates or whatever it is
however I addressed that data issue and
then I can run it through and
synchronize
but the key again the challenge always
comes down to the data because you have
to worry about things like
um and this is just you know it's just
the start of it but if you let's assume
that you are okay moving all of your
data or a portion of your data from
production to your development machine
which is much more common you know it's
really
I think a lot of us if we had a tool
that said
I want to be able to take
a hundred random rows out of production
pull those into my development
environment and have it bring those rows
plus all of the related data
so that I have effectively production
but a 100 row version of production then
it's like perfect for development
testing and things like that because now
I have I have actual production data
sitting in my development machine but I
only have a few rows of it
so it's not near the size
and I can do a lot of testing with it
now you know performance testing is not
going to be the same obviously but like
data and structure and design and coding
related stuff I can get all that
so one of the things to think about or
that's like the common data issues it's
just deleted records and in particular
um
it's not really General data and I'm
going to sort of classify stuff a little
bit here General data is not as much an
issue so if like if you have a customer
that exists in production and they don't
exist in your development
that's okay or if they I'm sorry if it
exists in development and it doesn't
exist in production that's okay however
if you have a lookup of some sort so if
you have
um let's say you have like four
different address types in production
and or let's say you had five so you had
I don't know shipping home billing uh
mailing and something office and you say
hey we're not going to use Office
anymore so now it's just going to be
billing shipping home
and whatever the other one was I used so
you go from five to four well those
lookup values now in your development is
going to be it can throw stuff off
you're giving a c stuff differently if
you think of like a web app like a crud
application that has drop downs that
utilize those lookups then suddenly you
know you're going to see options in
development that don't exist in
production and that sometimes can be an
issue because it may be that's you know
that's where the bug resides that you're
trying to track down or there's a
feature that you're working with that
changes because you have a different
number of uh Records that show versus
what would always show in production
um sort of late it is
probably one of the hardest things is
whether you care about
um and that's particularly if there's
what's you know like audit tables and
things like that whether you want to
bring down audit tables because those
can get huge and typically are not
terribly useful uh yeah the audit trails
that you have in production you may or
may not need in your development
environment uh your IDs are gonna
probably need to match this is a it's
probably the classic of a data database
replication is if my record one two
three four five in production
is copied into development then I
probably need it to be id12345 in
development to make sure that all of the
other things that use that ID are
properly you know Linked UP
and so lookup values and uh duplicates
are things you want to worry about
because you don't particularly like if
I've you know have my five different
uh mailing types and I synchronize down
to my development I want to be able to
say okay those four rows that I have in
development
um the ones that it wants to bring in
from production they already exist
so you know you can do it the simple way
and say the first thing you do is you
just delete all each truncate delete all
the data from all your tables and
development and you don't have to worry
about duplicates you're just going to
just you know pull data in from
production
but maybe you don't want to do that you
know maybe it's going to say or maybe
there's something that blocks that as
far as
um
referential integrity and things like
that is concerned so you have to worry
about the idea of are the four rows that
I have in development that are in my
destination the same as the four rows
that I have in my Target and if not I
need to be able to handle that you know
be able to basically properly edit
change delete update records
then that can be tricky because
sometimes you have to maintain IDs and
do on it do it on a row specific ID by
ID basis as opposed to thinking as a
table of values
and it comes one of the big questions
all or some of the data
and it goes back to if you've got a 10
terabyte production database
you're not going to want to bring 10
terabytes of data down to your
development machine here especially if
you've got a little development laptop
you're gonna so you know maybe you only
want to do some of the data
in which case
it's not as simple as saying okay just
give me 10 rows out of each table
because
the 10 rows you have have to be able to
link to data in the other tables if
they're you know you have to maintain
referential Integrity with those rows
that you bring across
data can be really challenging because
it's do you care if there's been edits
to data or not do you need to do a
detailed compare of your data rows that
says okay for each column is the value
identical if not update it
and then even then you can get into
stuff like some time stamps and things
like that it's like do I really want to
do it do I not and so those are all
considerations to Think Through
and of course it gets back to uh space
and size you know do I
if I do it in a way that is complete is
it still going to be essentially too big
for what I need and so there can be some
you know there's those kinds of
considerations you have to walk through
now all of these things add up to
um a very complicated
application
uh and the way that I went through it in
this is basically walking through each
of these items if you go all the way
back to the front each of these items in
sort of uh trying to take them by
themselves
and that's what we end up with is that
we've got
um we've got these little basically
these little functions
that deal with tables that deal with the
differences in tables that deal with the
differences The Columns within a table
that deals with the store procedures and
where their differences in store
procedures and with functions because
store Precision functions are a little
different and then indexes and foreign
key relationships and we actually go
into triggers as well which are I guess
that sort of fall into the same as store
procedures and functions and in most
those cases what we end up with is that
we are
pull this up
I don't have that one open right now
so if we go into
we end up with this thing that is
uh let's see yeah
we're basically we go in and we're we're
figuring out each of these little
problems by itself so we need to figure
out are there missing tables or the
tables are missing in in the destination
from The Source uh we need to compare it
where we just walk through and say okay
you know are the tables the same and if
not we have to figure out what we do are
there procedures missing are there
functions missing are there triggers
that are missing and we're just
generating in a lot of these cases we're
going through and we're
getting all the information from our
source so for like well it triggers
Paradise good but let's say for uh
functions get really nasty actually set
a parameter but
store procedure functions it's basically
we go through and we say hey
I want to figure out in my source and
then in my destination what are all of
the store procedures for example that I
have and then we're gonna have to do is
wherever it doesn't exist I have to
actually go dig up
the definition which luckily you know
databases store that is the definition
of it in the source and then from that
definition and its parameters
I need to be able to then I'm going to
go through and basically I'm generating
SQL on the Fly and then executing that
thank you

Transcript Segments

10.7

thank you

18.89

[Music]

29.279

so we get to indexes and keys which are

33.54

sort of a they they feel a little bit

37.14

like columns because typically what you

39.899

have is you've got like sort of like

41.64

with store procedures you have a sort of

43.739

a name of a index or key which is a lot

47.64

of cases secondary because really it's

50.1

the key the important things is what

52.32

owns it you know like let's say it's uh

54.6

what table is this index on and then

57.719

what column or columns yes basically

60.48

that signature is really what matters

63.059

the name is really not significant for

67.74

the most part other than if you're

70.26

trying to compare because sometimes you

73.38

you know you'll see the way my sequel

76.56

works it

77.76

you'll see that a column has an index on

80.52

it but you have to go digging around

82.32

otherwares otherwise other places to

85.2

figure out what that index is so you

87.84

need and base it ends up being sort of a

89.7

key you have to go find the name and you

91.5

use that name and then you can figure

out what's the table that it's on what's

94.619

the column or columns that it that it's

97.92

um this or part of that index and then

100.799

things like is this a is it a primary

103.14

key is it a unique index

107.04

um I think there's cluster there's a

108.72

couple others that I didn't even stumble

110.28

into yet as I was going through this but

112.14

there are different flavors of indexes

116.88

and then with those which is sort of a

120.42

precursor talking about data is that

121.92

there can be uh data Integrity issues

124.68

and particularly if you're changing

127.039

indexes and foreign keys

130.619

you can run into the situation where

132.78

it's awesome and your source and your

135.9

production data that you're now

137.28

replicating down but because of the data

140.879

that exists in your target

144.959

if you change that index or that key or

148.14

add that foreign key it doesn't allow

151.44

you to because the data

153.84

is has not does not conform to that uh

158.879

to that key or that index so for example

162.18

if you try to create a unique index on a

165.3

column that does not have unique values

167.519

it will fail it will say you can't do

169.68

that because you don't have unique

171.48

values currently and if I try to do this

173.58

it will break

174.72

so you sort of get it's almost a chicken

177.48

and egg kind of situation because what

180.54

you need to do is get that

182.9

get the data essentially corrected

185.879

changed altered first

188.519

and then deal with your indexes and keys

191.4

but

192.959

you sort of need the indexes and keys

194.819

there sometimes to help you figure out

196.94

what data you're going to move from your

200.28

source to your destination

204.48

and this is where we get to data issues

207.44

are really

209.4

through the whole thing are the toughest

212.22

things to Think Through

215.159

is while the structure is key and it's

219.18

actually it would not be it's not

221.7

terribly difficult to say if I don't

224.28

touch the data and all I want to do

228.019

is either synchronize you know it's

231.36

basically synchronize everything that I

232.62

can and then have a list of changes that

235.44

need to be done where I can't which

237.36

would be for example

239.879

uh I want to create I have a unique

241.739

index in production I want to create

244.379

that unique index in my target but the

247.5

data doesn't allow it have some sort of

249.78

like an exception log or something that

251.22

just says hey you want to create this or

253.739

this is a place this is something that

255.72

is not synchronized I can't synchronize

257.94

it you know automated because the data

259.919

has to change in your development

261.78

environment but then you would know just

263.58

say oh I'm going to go in I'm going to

265.62

tweak my data or delete the rows that

267.3

are duplicates or whatever it is

269.4

however I addressed that data issue and

272.759

then I can run it through and

274.08

synchronize

276.06

but the key again the challenge always

278.759

comes down to the data because you have

280.259

to worry about things like

283.08

um and this is just you know it's just

285.18

the start of it but if you let's assume

287.88

that you are okay moving all of your

290.22

data or a portion of your data from

293.34

production to your development machine

295.86

which is much more common you know it's

298.32

really

299.28

I think a lot of us if we had a tool

301.86

that said

303.479

I want to be able to take

306.66

a hundred random rows out of production

310.259

pull those into my development

311.639

environment and have it bring those rows

313.56

plus all of the related data

316.199

so that I have effectively production

319.02

but a 100 row version of production then

323.1

it's like perfect for development

324.539

testing and things like that because now

326.34

I have I have actual production data

331.44

sitting in my development machine but I

333.6

only have a few rows of it

335.699

so it's not near the size

337.86

and I can do a lot of testing with it

340.5

now you know performance testing is not

342.6

going to be the same obviously but like

345.539

data and structure and design and coding

348.419

related stuff I can get all that

351.6

so one of the things to think about or

353.22

that's like the common data issues it's

354.9

just deleted records and in particular

358.8

359.58

it's not really General data and I'm

364.44

going to sort of classify stuff a little

365.88

bit here General data is not as much an

367.68

issue so if like if you have a customer

369.06

that exists in production and they don't

370.62

exist in your development

372.3

that's okay or if they I'm sorry if it

374.22

exists in development and it doesn't

375.479

exist in production that's okay however

379.259

if you have a lookup of some sort so if

381.72

you have

383.58

um let's say you have like four

384.9

different address types in production

388.319

and or let's say you had five so you had

390.539

I don't know shipping home billing uh

394.62

mailing and something office and you say

399.84

hey we're not going to use Office

400.8

anymore so now it's just going to be

402.06

billing shipping home

403.86

and whatever the other one was I used so

406.08

you go from five to four well those

408.06

lookup values now in your development is

410.88

going to be it can throw stuff off

413.4

you're giving a c stuff differently if

415.319

you think of like a web app like a crud

417.18

application that has drop downs that

419.58

utilize those lookups then suddenly you

422.94

know you're going to see options in

424.68

development that don't exist in

426.06

production and that sometimes can be an

428.4

issue because it may be that's you know

431.22

that's where the bug resides that you're

432.72

trying to track down or there's a

433.979

feature that you're working with that

436.139

changes because you have a different

437.819

number of uh Records that show versus

441.18

what would always show in production

445.259

um sort of late it is

447.419

probably one of the hardest things is

449.4

whether you care about

452.28

um and that's particularly if there's

453.9

what's you know like audit tables and

455.46

things like that whether you want to

457.02

bring down audit tables because those

458.4

can get huge and typically are not

460.919

terribly useful uh yeah the audit trails

464.16

that you have in production you may or

465.479

may not need in your development

467.759

environment uh your IDs are gonna

471.419

probably need to match this is a it's

474.539

probably the classic of a data database

477.479

replication is if my record one two

480.599

three four five in production

483

is copied into development then I

485.639

probably need it to be id12345 in

488.099

development to make sure that all of the

490.319

other things that use that ID are

493.139

properly you know Linked UP

496.56

and so lookup values and uh duplicates

499.919

are things you want to worry about

501.18

because you don't particularly like if

503.22

I've you know have my five different

505.68

uh mailing types and I synchronize down

509.52

to my development I want to be able to

511.56

say okay those four rows that I have in

513.899

development

515.58

um the ones that it wants to bring in

516.599

from production they already exist

518.76

so you know you can do it the simple way

520.62

and say the first thing you do is you

522

just delete all each truncate delete all

524.82

the data from all your tables and

526.62

development and you don't have to worry

528

about duplicates you're just going to

529.38

just you know pull data in from

531.24

production

532.32

but maybe you don't want to do that you

534

know maybe it's going to say or maybe

535.2

there's something that blocks that as

536.94

far as

538.5

539.399

referential integrity and things like

541.26

that is concerned so you have to worry

543.779

about the idea of are the four rows that

546.6

I have in development that are in my

549.6

destination the same as the four rows

551.339

that I have in my Target and if not I

554.58

need to be able to handle that you know

555.899

be able to basically properly edit

558.839

change delete update records

561.899

then that can be tricky because

563.1

sometimes you have to maintain IDs and

566.22

do on it do it on a row specific ID by

569.88

ID basis as opposed to thinking as a

572.58

table of values

574.68

and it comes one of the big questions

576.06

all or some of the data

578.7

and it goes back to if you've got a 10

580.68

terabyte production database

583.26

you're not going to want to bring 10

584.94

terabytes of data down to your

586.32

development machine here especially if

587.58

you've got a little development laptop

588.72

you're gonna so you know maybe you only

591.06

want to do some of the data

593.399

in which case

595.98

it's not as simple as saying okay just

598.019

give me 10 rows out of each table

599.339

because

600.48

the 10 rows you have have to be able to

603.42

link to data in the other tables if

606.779

they're you know you have to maintain

607.86

referential Integrity with those rows

610.2

that you bring across

612.959

data can be really challenging because

615.24

it's do you care if there's been edits

618

to data or not do you need to do a

619.86

detailed compare of your data rows that

623.519

says okay for each column is the value

625.8

identical if not update it

629.459

and then even then you can get into

630.899

stuff like some time stamps and things

632.76

like that it's like do I really want to

634.2

do it do I not and so those are all

636.36

considerations to Think Through

638.76

and of course it gets back to uh space

641.58

and size you know do I

645

if I do it in a way that is complete is

648.6

it still going to be essentially too big

651.12

for what I need and so there can be some

653.94

you know there's those kinds of

655.26

considerations you have to walk through

656.82

now all of these things add up to

661.44

um a very complicated

664.68

application

666.18

uh and the way that I went through it in

669.12

this is basically walking through each

671.16

of these items if you go all the way

672.779

back to the front each of these items in

675.06

sort of uh trying to take them by

677.7

themselves

679.019

and that's what we end up with is that

680.88

we've got

682.8

um we've got these little basically

684.899

these little functions

686.839

that deal with tables that deal with the

690.779

differences in tables that deal with the

692.339

differences The Columns within a table

693.959

that deals with the store procedures and

696.6

where their differences in store

697.68

procedures and with functions because

699.839

store Precision functions are a little

702.42

different and then indexes and foreign

705.18

key relationships and we actually go

706.8

into triggers as well which are I guess

709.079

that sort of fall into the same as store

710.88

procedures and functions and in most

713.22

those cases what we end up with is that

715.68

we are

717.54

pull this up

723.18

I don't have that one open right now

726.06

so if we go into

730.56

we end up with this thing that is

734.899

uh let's see yeah

737.22

we're basically we go in and we're we're

739.5

figuring out each of these little

740.94

problems by itself so we need to figure

742.92

out are there missing tables or the

744.959

tables are missing in in the destination

747

from The Source uh we need to compare it

749.7

where we just walk through and say okay

751.92

you know are the tables the same and if

754.86

not we have to figure out what we do are

756.42

there procedures missing are there

758.76

functions missing are there triggers

762.42

that are missing and we're just

763.98

generating in a lot of these cases we're

765.48

going through and we're

767.76

getting all the information from our

769.62

source so for like well it triggers

772.5

Paradise good but let's say for uh

775.079

functions get really nasty actually set

777.42

a parameter but

778.92

store procedure functions it's basically

780.839

we go through and we say hey

782.459

I want to figure out in my source and

785.16

then in my destination what are all of

787.32

the store procedures for example that I

789.42

have and then we're gonna have to do is

791.12

wherever it doesn't exist I have to

793.32

actually go dig up

795.36

the definition which luckily you know

798.18

databases store that is the definition

799.86

of it in the source and then from that

804.18

definition and its parameters

808.079

I need to be able to then I'm going to

810.12

go through and basically I'm generating

811.86

SQL on the Fly and then executing that

825.74

thank you