Detailed Notes
After our series of building a python application to synchronize databases at the schema and data level we have lessons learned. There are areas where we need to be intentional as we decide how to maintain and enhance our databases. This is part 2.
Learn More at https://develpreneur.com
Transcript Text
thank you [Music] so we get to indexes and keys which are sort of a they they feel a little bit like columns because typically what you have is you've got like sort of like with store procedures you have a sort of a name of a index or key which is a lot of cases secondary because really it's the key the important things is what owns it you know like let's say it's uh what table is this index on and then what column or columns yes basically that signature is really what matters the name is really not significant for the most part other than if you're trying to compare because sometimes you you know you'll see the way my sequel works it you'll see that a column has an index on it but you have to go digging around otherwares otherwise other places to figure out what that index is so you need and base it ends up being sort of a key you have to go find the name and you use that name and then you can figure out what's the table that it's on what's the column or columns that it that it's um this or part of that index and then things like is this a is it a primary key is it a unique index um I think there's cluster there's a couple others that I didn't even stumble into yet as I was going through this but there are different flavors of indexes and then with those which is sort of a precursor talking about data is that there can be uh data Integrity issues and particularly if you're changing indexes and foreign keys you can run into the situation where it's awesome and your source and your production data that you're now replicating down but because of the data that exists in your target if you change that index or that key or add that foreign key it doesn't allow you to because the data is has not does not conform to that uh to that key or that index so for example if you try to create a unique index on a column that does not have unique values it will fail it will say you can't do that because you don't have unique values currently and if I try to do this it will break so you sort of get it's almost a chicken and egg kind of situation because what you need to do is get that get the data essentially corrected changed altered first and then deal with your indexes and keys but you sort of need the indexes and keys there sometimes to help you figure out what data you're going to move from your source to your destination and this is where we get to data issues are really through the whole thing are the toughest things to Think Through is while the structure is key and it's actually it would not be it's not terribly difficult to say if I don't touch the data and all I want to do is either synchronize you know it's basically synchronize everything that I can and then have a list of changes that need to be done where I can't which would be for example uh I want to create I have a unique index in production I want to create that unique index in my target but the data doesn't allow it have some sort of like an exception log or something that just says hey you want to create this or this is a place this is something that is not synchronized I can't synchronize it you know automated because the data has to change in your development environment but then you would know just say oh I'm going to go in I'm going to tweak my data or delete the rows that are duplicates or whatever it is however I addressed that data issue and then I can run it through and synchronize but the key again the challenge always comes down to the data because you have to worry about things like um and this is just you know it's just the start of it but if you let's assume that you are okay moving all of your data or a portion of your data from production to your development machine which is much more common you know it's really I think a lot of us if we had a tool that said I want to be able to take a hundred random rows out of production pull those into my development environment and have it bring those rows plus all of the related data so that I have effectively production but a 100 row version of production then it's like perfect for development testing and things like that because now I have I have actual production data sitting in my development machine but I only have a few rows of it so it's not near the size and I can do a lot of testing with it now you know performance testing is not going to be the same obviously but like data and structure and design and coding related stuff I can get all that so one of the things to think about or that's like the common data issues it's just deleted records and in particular um it's not really General data and I'm going to sort of classify stuff a little bit here General data is not as much an issue so if like if you have a customer that exists in production and they don't exist in your development that's okay or if they I'm sorry if it exists in development and it doesn't exist in production that's okay however if you have a lookup of some sort so if you have um let's say you have like four different address types in production and or let's say you had five so you had I don't know shipping home billing uh mailing and something office and you say hey we're not going to use Office anymore so now it's just going to be billing shipping home and whatever the other one was I used so you go from five to four well those lookup values now in your development is going to be it can throw stuff off you're giving a c stuff differently if you think of like a web app like a crud application that has drop downs that utilize those lookups then suddenly you know you're going to see options in development that don't exist in production and that sometimes can be an issue because it may be that's you know that's where the bug resides that you're trying to track down or there's a feature that you're working with that changes because you have a different number of uh Records that show versus what would always show in production um sort of late it is probably one of the hardest things is whether you care about um and that's particularly if there's what's you know like audit tables and things like that whether you want to bring down audit tables because those can get huge and typically are not terribly useful uh yeah the audit trails that you have in production you may or may not need in your development environment uh your IDs are gonna probably need to match this is a it's probably the classic of a data database replication is if my record one two three four five in production is copied into development then I probably need it to be id12345 in development to make sure that all of the other things that use that ID are properly you know Linked UP and so lookup values and uh duplicates are things you want to worry about because you don't particularly like if I've you know have my five different uh mailing types and I synchronize down to my development I want to be able to say okay those four rows that I have in development um the ones that it wants to bring in from production they already exist so you know you can do it the simple way and say the first thing you do is you just delete all each truncate delete all the data from all your tables and development and you don't have to worry about duplicates you're just going to just you know pull data in from production but maybe you don't want to do that you know maybe it's going to say or maybe there's something that blocks that as far as um referential integrity and things like that is concerned so you have to worry about the idea of are the four rows that I have in development that are in my destination the same as the four rows that I have in my Target and if not I need to be able to handle that you know be able to basically properly edit change delete update records then that can be tricky because sometimes you have to maintain IDs and do on it do it on a row specific ID by ID basis as opposed to thinking as a table of values and it comes one of the big questions all or some of the data and it goes back to if you've got a 10 terabyte production database you're not going to want to bring 10 terabytes of data down to your development machine here especially if you've got a little development laptop you're gonna so you know maybe you only want to do some of the data in which case it's not as simple as saying okay just give me 10 rows out of each table because the 10 rows you have have to be able to link to data in the other tables if they're you know you have to maintain referential Integrity with those rows that you bring across data can be really challenging because it's do you care if there's been edits to data or not do you need to do a detailed compare of your data rows that says okay for each column is the value identical if not update it and then even then you can get into stuff like some time stamps and things like that it's like do I really want to do it do I not and so those are all considerations to Think Through and of course it gets back to uh space and size you know do I if I do it in a way that is complete is it still going to be essentially too big for what I need and so there can be some you know there's those kinds of considerations you have to walk through now all of these things add up to um a very complicated application uh and the way that I went through it in this is basically walking through each of these items if you go all the way back to the front each of these items in sort of uh trying to take them by themselves and that's what we end up with is that we've got um we've got these little basically these little functions that deal with tables that deal with the differences in tables that deal with the differences The Columns within a table that deals with the store procedures and where their differences in store procedures and with functions because store Precision functions are a little different and then indexes and foreign key relationships and we actually go into triggers as well which are I guess that sort of fall into the same as store procedures and functions and in most those cases what we end up with is that we are pull this up I don't have that one open right now so if we go into we end up with this thing that is uh let's see yeah we're basically we go in and we're we're figuring out each of these little problems by itself so we need to figure out are there missing tables or the tables are missing in in the destination from The Source uh we need to compare it where we just walk through and say okay you know are the tables the same and if not we have to figure out what we do are there procedures missing are there functions missing are there triggers that are missing and we're just generating in a lot of these cases we're going through and we're getting all the information from our source so for like well it triggers Paradise good but let's say for uh functions get really nasty actually set a parameter but store procedure functions it's basically we go through and we say hey I want to figure out in my source and then in my destination what are all of the store procedures for example that I have and then we're gonna have to do is wherever it doesn't exist I have to actually go dig up the definition which luckily you know databases store that is the definition of it in the source and then from that definition and its parameters I need to be able to then I'm going to go through and basically I'm generating SQL on the Fly and then executing that thank you
Transcript Segments
thank you
[Music]
so we get to indexes and keys which are
sort of a they they feel a little bit
like columns because typically what you
have is you've got like sort of like
with store procedures you have a sort of
a name of a index or key which is a lot
of cases secondary because really it's
the key the important things is what
owns it you know like let's say it's uh
what table is this index on and then
what column or columns yes basically
that signature is really what matters
the name is really not significant for
the most part other than if you're
trying to compare because sometimes you
you know you'll see the way my sequel
works it
you'll see that a column has an index on
it but you have to go digging around
otherwares otherwise other places to
figure out what that index is so you
need and base it ends up being sort of a
key you have to go find the name and you
use that name and then you can figure
out what's the table that it's on what's
the column or columns that it that it's
um this or part of that index and then
things like is this a is it a primary
key is it a unique index
um I think there's cluster there's a
couple others that I didn't even stumble
into yet as I was going through this but
there are different flavors of indexes
and then with those which is sort of a
precursor talking about data is that
there can be uh data Integrity issues
and particularly if you're changing
indexes and foreign keys
you can run into the situation where
it's awesome and your source and your
production data that you're now
replicating down but because of the data
that exists in your target
if you change that index or that key or
add that foreign key it doesn't allow
you to because the data
is has not does not conform to that uh
to that key or that index so for example
if you try to create a unique index on a
column that does not have unique values
it will fail it will say you can't do
that because you don't have unique
values currently and if I try to do this
it will break
so you sort of get it's almost a chicken
and egg kind of situation because what
you need to do is get that
get the data essentially corrected
changed altered first
and then deal with your indexes and keys
but
you sort of need the indexes and keys
there sometimes to help you figure out
what data you're going to move from your
source to your destination
and this is where we get to data issues
are really
through the whole thing are the toughest
things to Think Through
is while the structure is key and it's
actually it would not be it's not
terribly difficult to say if I don't
touch the data and all I want to do
is either synchronize you know it's
basically synchronize everything that I
can and then have a list of changes that
need to be done where I can't which
would be for example
uh I want to create I have a unique
index in production I want to create
that unique index in my target but the
data doesn't allow it have some sort of
like an exception log or something that
just says hey you want to create this or
this is a place this is something that
is not synchronized I can't synchronize
it you know automated because the data
has to change in your development
environment but then you would know just
say oh I'm going to go in I'm going to
tweak my data or delete the rows that
are duplicates or whatever it is
however I addressed that data issue and
then I can run it through and
synchronize
but the key again the challenge always
comes down to the data because you have
to worry about things like
um and this is just you know it's just
the start of it but if you let's assume
that you are okay moving all of your
data or a portion of your data from
production to your development machine
which is much more common you know it's
really
I think a lot of us if we had a tool
that said
I want to be able to take
a hundred random rows out of production
pull those into my development
environment and have it bring those rows
plus all of the related data
so that I have effectively production
but a 100 row version of production then
it's like perfect for development
testing and things like that because now
I have I have actual production data
sitting in my development machine but I
only have a few rows of it
so it's not near the size
and I can do a lot of testing with it
now you know performance testing is not
going to be the same obviously but like
data and structure and design and coding
related stuff I can get all that
so one of the things to think about or
that's like the common data issues it's
just deleted records and in particular
um
it's not really General data and I'm
going to sort of classify stuff a little
bit here General data is not as much an
issue so if like if you have a customer
that exists in production and they don't
exist in your development
that's okay or if they I'm sorry if it
exists in development and it doesn't
exist in production that's okay however
if you have a lookup of some sort so if
you have
um let's say you have like four
different address types in production
and or let's say you had five so you had
I don't know shipping home billing uh
mailing and something office and you say
hey we're not going to use Office
anymore so now it's just going to be
billing shipping home
and whatever the other one was I used so
you go from five to four well those
lookup values now in your development is
going to be it can throw stuff off
you're giving a c stuff differently if
you think of like a web app like a crud
application that has drop downs that
utilize those lookups then suddenly you
know you're going to see options in
development that don't exist in
production and that sometimes can be an
issue because it may be that's you know
that's where the bug resides that you're
trying to track down or there's a
feature that you're working with that
changes because you have a different
number of uh Records that show versus
what would always show in production
um sort of late it is
probably one of the hardest things is
whether you care about
um and that's particularly if there's
what's you know like audit tables and
things like that whether you want to
bring down audit tables because those
can get huge and typically are not
terribly useful uh yeah the audit trails
that you have in production you may or
may not need in your development
environment uh your IDs are gonna
probably need to match this is a it's
probably the classic of a data database
replication is if my record one two
three four five in production
is copied into development then I
probably need it to be id12345 in
development to make sure that all of the
other things that use that ID are
properly you know Linked UP
and so lookup values and uh duplicates
are things you want to worry about
because you don't particularly like if
I've you know have my five different
uh mailing types and I synchronize down
to my development I want to be able to
say okay those four rows that I have in
development
um the ones that it wants to bring in
from production they already exist
so you know you can do it the simple way
and say the first thing you do is you
just delete all each truncate delete all
the data from all your tables and
development and you don't have to worry
about duplicates you're just going to
just you know pull data in from
production
but maybe you don't want to do that you
know maybe it's going to say or maybe
there's something that blocks that as
far as
um
referential integrity and things like
that is concerned so you have to worry
about the idea of are the four rows that
I have in development that are in my
destination the same as the four rows
that I have in my Target and if not I
need to be able to handle that you know
be able to basically properly edit
change delete update records
then that can be tricky because
sometimes you have to maintain IDs and
do on it do it on a row specific ID by
ID basis as opposed to thinking as a
table of values
and it comes one of the big questions
all or some of the data
and it goes back to if you've got a 10
terabyte production database
you're not going to want to bring 10
terabytes of data down to your
development machine here especially if
you've got a little development laptop
you're gonna so you know maybe you only
want to do some of the data
in which case
it's not as simple as saying okay just
give me 10 rows out of each table
because
the 10 rows you have have to be able to
link to data in the other tables if
they're you know you have to maintain
referential Integrity with those rows
that you bring across
data can be really challenging because
it's do you care if there's been edits
to data or not do you need to do a
detailed compare of your data rows that
says okay for each column is the value
identical if not update it
and then even then you can get into
stuff like some time stamps and things
like that it's like do I really want to
do it do I not and so those are all
considerations to Think Through
and of course it gets back to uh space
and size you know do I
if I do it in a way that is complete is
it still going to be essentially too big
for what I need and so there can be some
you know there's those kinds of
considerations you have to walk through
now all of these things add up to
um a very complicated
application
uh and the way that I went through it in
this is basically walking through each
of these items if you go all the way
back to the front each of these items in
sort of uh trying to take them by
themselves
and that's what we end up with is that
we've got
um we've got these little basically
these little functions
that deal with tables that deal with the
differences in tables that deal with the
differences The Columns within a table
that deals with the store procedures and
where their differences in store
procedures and with functions because
store Precision functions are a little
different and then indexes and foreign
key relationships and we actually go
into triggers as well which are I guess
that sort of fall into the same as store
procedures and functions and in most
those cases what we end up with is that
we are
pull this up
I don't have that one open right now
so if we go into
we end up with this thing that is
uh let's see yeah
we're basically we go in and we're we're
figuring out each of these little
problems by itself so we need to figure
out are there missing tables or the
tables are missing in in the destination
from The Source uh we need to compare it
where we just walk through and say okay
you know are the tables the same and if
not we have to figure out what we do are
there procedures missing are there
functions missing are there triggers
that are missing and we're just
generating in a lot of these cases we're
going through and we're
getting all the information from our
source so for like well it triggers
Paradise good but let's say for uh
functions get really nasty actually set
a parameter but
store procedure functions it's basically
we go through and we say hey
I want to figure out in my source and
then in my destination what are all of
the store procedures for example that I
have and then we're gonna have to do is
wherever it doesn't exist I have to
actually go dig up
the definition which luckily you know
databases store that is the definition
of it in the source and then from that
definition and its parameters
I need to be able to then I'm going to
go through and basically I'm generating
SQL on the Fly and then executing that
thank you