Detailed Notes
This is part two of a short series on data scraping. We cover the topic at a high level and describe the different forms of the solutions. That includes calling an API, downloading reports, and even scraping data a field at a time while crawling a web page. There are tips and tricks included to help you create a high-quality scraping solution and to properly set expectations.
This comes from the Develpreneur.com mentor series of presentations.
Transcript Text
foreign [Music] so you're probably going to have to do some you know some additional posts find links click on those essentially post that get the data back and then you may even need to do things like you know trigger some searches or filters or even do delays so I may want to um let's say well let me just do like a Best Buy it's like we've seen that before so I could go out to Best Buy and I think this is actually one example we used in the past so let's say I want to look for like PS5 consoles so I what I'm doing here is why we need the application to do so it would say Okay trigger the search and then it's going to have to walk through all of the you know result items that it wants and then I'm going to have to scrape those grab the data and um you know kick it out into my uh into my database or data store so that's scraping it can be I said it can be time consuming it can and it can be very fragile because if they change their site layout and format then it basically is going to break the scraping piece because you're assuming that things follow a certain structure and it's not a published structure so it's it's sort of like an undocumented feature is that you could go in you can go in and scrape a site but as soon as they change something you you're gonna have to update your your scrapers you know another one that is not used as often but Falls more into a category like a an API is using RSS feeds in this case you're going to get really good structured data much like an API it's the RSS feeds tend to be give you less parameters but at least we'll give you data you know some sort of feed of data for a time frame or maybe out to a certain you know a limited number of results it's really easy to call so you could take for example if I go out to the developer North site and I think if I just do I think if I just do RSS uh wait I don't want to do it here because it's not going to display right let me go do it in uh if I go do it and which I don't have up let me jump over here real quick and do foreign I think I have Google Chrome yes so if I go here and I do developpreneur if I can spell it right then I'm going to get back something like this so it's going to be XML which makes it nice and then you're going to be able to see that it's got some formats here so like for this RSS feed I can see what the channel is I can see if there's an image involved I can always store that I can go into the item itself and I can see stuff like my title I can see a link to it who's the creator what's the publishing date I've got category Fields I've got a uh whatever this is basically a disc oh here we go my description comments I mean all of this is XML formatted so I can take that and although like you saw I didn't have I had no parameters so this is just bringing data back however long it brings it back um in this case since it's WordPress it it's set how many articles back you can get but these are nice and formatted and it's XML and so we've got a lot of ways to to parse our way through XML and pull that information out and suck it into our database so again very similar to an API apis off apis May return XML format as well uh although you know sometimes it may be easier to use Json they will because a lot of times you can get that kind of format but in general RSS feeds are far easier to deal with than scraping itself you know actual web page scraping and therefore can be a very powerful tool to use for your scraping efforts so let's talk about some tips tricks and trips basically trip ups that you may uh run into sort of like some helpful hints if you're going to get into a scraping project first thing to do is start simple and validate your calls whether this is a an API call whether it is scraping a site with an API call or even an RSS feed it with an API call maybe put as tight a parameter for the query as possible you know if you can narrow it down to an API call that is by a single ID so that you know it's a single record that's what you want to start with make sure in the simplest case in a web scraping it would be like a single page I want to go to that page and I want to just be able to pull that data or maybe pull that data and then navigate to the detailed page and pull the data from there going through that process of figuring out exactly what you want to get verifying that you can get it building out the mappings and storing that data and making sure that it maintains its you know Integrity essentially is always going to be that first step and it's going to be so much easier to do that when you do it with a small result set than trying to spin through you know huge numbers of Records uh one because it takes a while to go through those records so your cycle time is going to be pretty big but also because it's easy it's easy to get lost in dealing with a whole bunch of different things and special cases where it's better to pick one get that working and then expand it out to Future records and then in that case if you can see if you see stuff where there are special cases or some places where fields are missing or things like that then you can address that accordingly on a on a case-by-case basis to allow it to support a wider range of results you're going to want to look for hidden fields and by that um those are going to be things that maybe you know don't particularly on a website fields that are are a type of hidden you know things that are not necessarily obvious in an API they may be some some things that are maybe even undocumented you know maybe you see the result sets that the documentation has but when you actually pull the API results you see there are additional fields a lot of times these are going to be things like ID internal IDs and things of that nature that are going to be needed in order for you to you know maybe to properly link to another web page or to make a subsequent call that gets you more detail about the data that you're pulling since we usually go with multiple data sources it is very helpful to use a you know a pattern where you have intermediary type classes or structures to store stuff in and then those things do the save and update to your data store whatever that happens today so then what you end up doing is your first time through you have to build the whole thing you build the parsers or the you know the scraper the parser and then you build the data structure and then you have to build the you know the insert update delete probably the crud related functions so that those data stores can be the data in those data stores can be pushed into your database or in those you know data structures can be pushed in but then the bonus is the next time around if you do that right then all you have to do is build the parser mapper into your you know General structure and then everything behind that works because you know that all your inserts updates deletes all work as long as the data is valid and so you just need to do um and the next time through you just need to do your parsing and then since you've already done even probably a lot of the validations within those classes then you you parse or scrape put that data into class and then run your validation steps and then you know voila you know whether you have a valid um set of data that you've pulled and with that you're probably going to want to do either in those classes or it's sort of like a you know utility functions you're going to find that formatting type helpers are very very valuable do not assume particularly against across multiple data sources that stuff's always going to have the same format phone numbers dates phone numbers and dates in particular are probably the the most common things you'll run into one site may be a phone number is just 10 straight numbers another one maybe it always has you know hyphens built into there some of them may be formatted with like little parentheses and such so you're going to want to have something that takes data that kind of data those individual data items and does some sort of a conversion to a standard format that you're going to store like I said dates are huge for that there's a lot of times that dates will be all over the map as far as what you know style and format they use and a lot of them because they're going to be localized and it may not be an actual date time field where you can just you know do your conversion real easy particularly uh web scraping stuff there's all kinds of different ways you may see it displayed so what you're going to want to do is have some way to like a helper have these things that take a string in and then can do some formatting and then spit it out in a consistent format that's the one that you're going to use for your data storage another thing is like I said date and time is going to vary quite a bit as far as their format and and money as well currency because it you know currency is different from maybe even country to Country you're gonna have different types of currencies and of course you're going to have different uh symbols around that and and things of that nature you may even have different um scale of numbers you know you may have something that where a million of something is never heard of in the AMA of another where everyday transactions are a million of that currency but with those um the that's again where you're going to want to have those formatting helpers are going to be huge because there's going to be so many different ways you can see it that you're going to want to be able to sort of cut through the the formatting things that may be on a display and get to the actual data items that you need so that you can convert those into you know into your system now dates and times in particular are can be a challenge when you are pulling from different data sources if dates and times need to match up with data from other data sources because then you may run into situations where you need to know what time zone it's in um and you probably are going to have to convert to for Tates dates times uh money in currencies you're probably going to want to convert to sort of a universal um you know time zone or Universal monetary unit and then be able to convert back out if you need to so like for money and maybe it's you hit sites that use euros and ruples and um and again and a whole bunch of different stuff but you decide what I'm going to do is convert that to a dollar amount store that is you know US dollars in my database and then when I kick stuff back out people convert or my my site that you know my features that provide those monetary values will convert it to whatever monetary system the user wants there's a lot of work that can go into those but once you have them they can be pretty valuable and actually the nice thing with a lot of those formatting helpers is that they are and even some of those structures they're going to actually be sort of portable from from Project to project uh a lot of the validators particularly because phone uh addresses you know things like zip codes and that can be very different those are going to show up an app app or app after app so once you've got those built in those are that's a nice code to be able to carry with you to your next project um sizing and outliers this is this is like probably the biggest challenge of scraping a lot of the times um but also just figuring out how to take multi-data sources and combine them into something that's uh sort of a you know a single data source which is your store because you may have values that are typically a certain length you know I think of a name if you have a last name probably I don't know maybe 20 characters is typically typically going to be the limit you know maybe 30. but then you may have somebody that's got some really convoluted hyphenated you know it's like their last name is like Jones hyphen Brown hyphen Smith hyphen you know McGillicuddy you know and it's just this big long thing and so you need to figure out how do we handle that do we need to you know have some incredibly huge long data size to be able to store what we think is anything or you know do we have to have some way to be able to either like encode that so it gets in you know it gets cut down to a smaller value or um you know maybe truncated or something along those lines and then you have to look at outliers as well you're going to find situations where 99 999 times out of a thousand the data looks like this but every so often it looks a little different and that may be due to special characters which will you know see a little bit as well what those things look at but it may be things like uh particularly like web scraping that maybe it the data is almost always on one line except for the sometimes when it goes to a second line you know those kinds of outliers are things that you're going to have to sort of keep an eye out for when you're scraping you got to look at what happens if there's a no or an empty value because sometimes data just doesn't get entered and it can be inconsistent so you may have a like an address that's typically address line city state ZIP and you'll have data in there that's got address lines City no state but a zip or maybe city state but no zip or maybe no City but a state is it you know or maybe no address line in just a zip it's you've got to be aware that there's going to be essentially gaps in the data and then how do you handle that you know how do you want to handle that in those cases because sometimes that's critical sometimes that data is very important for you to be able to really uh to utilize it uh for example like let's go back to that job data let's say that part of your what you do is you're you know giving a list of Health Care related jobs by City if you pull data that doesn't tell you what city it's the jobs in then that's going to be it's almost useless to you because you don't know what city to put it in um but in the same situation what happens that they list it in four different cities you want to make sure that that you know maybe that gets replicated to you know four records one for each City but then what happens if somebody searches you know two or three different cities do you want to deal with uh seeing that same job show up twice or more often foreign
Transcript Segments
foreign
[Music]
so you're probably going to have to do
some you know some additional posts find
links click on those essentially post
that get the data back
and then you may even need to do things
like you know trigger some searches or
filters or even do delays so I may want
to
um let's say well let me just do like a
Best Buy
it's like we've seen that before so I
could go out to Best Buy
and I think this is actually one example
we used in the past so let's say I want
to look for like PS5 consoles so I
what I'm doing here is why we need the
application to do so it would say Okay
trigger the search and then it's going
to have to walk through all of the you
know result items that it wants and then
I'm going to have to scrape those grab
the data and
um you know kick it out into my
uh into my database or data store
so that's scraping it can be I said it
can be time consuming it can and it can
be very fragile because if they change
their site layout and format then it
basically is going to break the scraping
piece because you're assuming that
things follow a certain structure and
it's not a published structure so it's
it's sort of like an undocumented
feature is that you could go in you can
go in and scrape a site
but as soon as they change something you
you're gonna have to update your your
scrapers
you know another one that is not used as
often but
Falls more into a category like a an API
is using RSS feeds
in this case you're going to get really
good structured data much like an API
it's
the RSS feeds tend to be give you less
parameters but at least we'll give you
data you know some sort of feed of data
for a time frame or maybe out to a
certain you know a limited number of
results it's really easy to call so you
could take for example
if I go out to the developer North site
and I think if I just do
I think if I just do RSS
uh wait I don't want to do it here
because it's not going to display right
let me go do it in
uh if I go do it and which I don't have
up
let me jump over here real quick and do
foreign
I think I have Google Chrome yes
so if I go here and I do developpreneur
if I can spell it right
then I'm going to get back something
like this so it's going to be XML which
makes it nice and then you're going to
be able to see that it's got some
formats here so like for this RSS feed I
can see what the channel is I can see if
there's an image involved I can always
store that I can go into the item itself
and I can see stuff like my title I can
see a link to it who's the creator
what's the publishing date I've got
category Fields I've got a uh whatever
this is basically a disc oh here we go
my description
comments I mean all of this is XML
formatted so I can take that and
although like you saw I didn't have
I had no parameters so this is just
bringing data back however long it
brings it back
um in this case since it's WordPress it
it's set how many articles back you can
get
but these are nice and formatted and
it's XML and so we've got a lot of ways
to to parse our way through XML and pull
that information out and suck it into
our database so again very similar to an
API apis off apis May return XML format
as well
uh although you know sometimes it may be
easier to use Json they will because a
lot of times you can get that kind of
format
but in general RSS feeds are far easier
to deal with than scraping itself you
know actual web page scraping and
therefore can be a very powerful tool to
use for your scraping efforts
so let's talk about some tips tricks and
trips basically trip ups that you may uh
run into sort of like
some helpful hints if you're going to
get into a scraping project
first thing to do is start simple and
validate your calls whether this is a an
API call whether it is scraping a site
with an API call or even an RSS feed
it with an API call maybe put as tight a
parameter
for the query as possible you know if
you can narrow it down to an API call
that is by a single ID so that you know
it's a single record
that's what you want to start with make
sure in the simplest case in a web
scraping it would be like a single page
I want to go to that page and I want to
just be able to pull that data or maybe
pull that data and then navigate to the
detailed page and pull the data from
there
going through that process of
figuring out exactly what you want to
get verifying that you can get it
building out the mappings and storing
that data and making sure that it
maintains its you know Integrity
essentially is always going to be that
first step and it's going to be so much
easier to do that when you do it with a
small result set than trying to spin
through you know huge numbers of Records
uh one because
it takes a while to go through those
records so your cycle time is going to
be pretty big
but also because it's easy it's easy to
get lost in dealing with a whole bunch
of different things and special cases
where it's better to pick one get that
working and then expand it out to Future
records and then in that case if you can
see if you see stuff where there are
special cases or some places where
fields are missing or things like that
then you can address that accordingly on
a on a case-by-case basis to allow it to
support a wider range of results
you're going to want to look for hidden
fields and by that
um those are going to be things that
maybe you know don't particularly on a
website fields that are are a type of
hidden you know things that are not
necessarily obvious in an API they may
be some some things that are maybe even
undocumented you know maybe you see the
result sets that the documentation has
but when you actually pull the API
results you see there are additional
fields
a lot of times these are going to be
things like ID internal IDs and things
of that nature that are going to be
needed in order for you to you know
maybe to properly link to another web
page or to make a subsequent call that
gets you more detail about the data that
you're pulling
since we usually go with multiple data
sources
it is very helpful to use a you know a
pattern where you have intermediary type
classes or structures to store stuff in
and then those things do the save and
update to your data store whatever that
happens today
so then what you end up doing
is your first time through
you have to build the whole thing you
build the parsers or the you know the
scraper the parser and then you build
the data structure and then you have to
build the you know the insert update
delete probably the crud related
functions so that those data stores can
be the data in those data stores can be
pushed into your database or in those
you know data structures can be pushed
in but then the bonus is the next time
around if you do that right then all you
have to do is build the parser mapper
into your you know General structure and
then everything behind that works
because you know that all your inserts
updates deletes all work as long as the
data is valid
and so you just need to do
um and the next time through you just
need to do your parsing and then since
you've already done even probably a lot
of the validations within those classes
then you you parse or scrape put that
data into class and then run your
validation steps and then you know voila
you know whether you have a valid
um set of data that you've pulled
and with that
you're probably going to want to do
either in those classes or it's sort of
like a you know utility functions
you're going to find that formatting
type helpers are very very valuable do
not assume particularly against across
multiple data sources that stuff's
always going to have the same format
phone numbers dates phone numbers and
dates in particular are probably the the
most common things you'll run into one
site may be a phone number is just 10
straight numbers another one maybe it
always has you know hyphens built into
there some of them may be formatted with
like little parentheses and such
so
you're going to want to have something
that takes data that kind of data those
individual data items
and does some sort of a conversion to a
standard format that you're going to
store like I said dates are huge for
that there's a lot of times that dates
will be all over the map as far as what
you know style and format they use and a
lot of them because they're going to be
localized
and it may not be an actual date time
field where you can just you know do
your conversion real easy particularly
uh web scraping stuff
there's all kinds of different ways you
may see it displayed
so what you're going to want to do is
have some way to like a helper have
these things that take a string in and
then can do some formatting and then
spit it out in a consistent format
that's the one that you're going to use
for your data storage
another thing is like I said date and
time is going to vary quite a bit
as far as their format and and money as
well currency because it you know
currency is different from maybe even
country to Country you're gonna have
different types of currencies and of
course you're going to have different uh
symbols around that and and things of
that nature you may even have different
um scale of numbers you know you may
have something that where a million of
something is never heard of in the AMA
of another where everyday transactions
are a million of that currency
but with those
um the that's again where you're going
to want to have those formatting helpers
are going to be huge because there's
going to be so many different ways you
can see it that you're going to want to
be able to sort of cut through the the
formatting things that may be on a
display and get to the actual data items
that you need so that you can convert
those into you know into your system
now dates and times in particular
are can be a challenge when you are
pulling from different data sources if
dates and times need to match up with
data from other data sources because
then you may run into situations where
you need to know what time zone it's in
um and you
probably are going to have to convert to
for Tates dates times uh money in
currencies you're probably going to want
to convert to sort of a universal
um you know time zone or Universal
monetary unit and then be able to
convert back out if you need to so like
for money and maybe it's you hit sites
that use euros and ruples and
um and again and a whole bunch of
different stuff but you decide what I'm
going to do is convert that to a dollar
amount store that is you know US dollars
in my database and then when I kick
stuff back out people convert or my my
site that you know my features that
provide those monetary values will
convert it to whatever monetary system
the user wants
there's a lot of work that can go into
those but once you have them they can be
pretty valuable and actually the nice
thing with a lot of those formatting
helpers is that they are and even some
of those structures they're going to
actually be sort of portable from from
Project to project uh a lot of the
validators particularly because phone uh
addresses you know things like zip codes
and that can be very different those are
going to show up an app app or app after
app so once you've got those built in
those are that's a nice code to be able
to carry with you to your next project
um sizing and outliers this is
this is like probably the biggest
challenge of scraping a lot of the times
um but also just figuring out how to
take
multi-data sources and combine them into
something that's uh sort of a you know a
single data source which is your store
because you may have values that are
typically
a certain length you know I think of a
name
if you have a last name
probably
I don't know maybe 20 characters is
typically typically going to be the
limit you know maybe 30.
but then you may have somebody that's
got some really convoluted hyphenated
you know it's like
their last name is like Jones hyphen
Brown hyphen Smith hyphen you know
McGillicuddy you know and it's just this
big long thing and so you need to figure
out how do we handle that do we need to
you know have some incredibly huge long
data size to be able to store what we
think is anything or you know do we have
to have some way to be able to
either like encode that so it gets in
you know it gets cut down to a smaller
value or
um you know maybe truncated or something
along those lines and then you have to
look at outliers as well you're going to
find situations where 99 999 times out
of a thousand the data looks like this
but every so often it looks a little
different
and that may be due to special
characters which will you know see a
little bit as well what those things
look at but it may be things like uh
particularly like web scraping that
maybe it the data is almost always on
one line except for the sometimes when
it goes to a second line
you know those kinds of outliers are
things that you're going to have to sort
of keep an eye out for when you're
scraping you got to look at what happens
if there's a no or an empty value
because sometimes data just doesn't get
entered and it can be inconsistent so
you may have a like an address that's
typically address line city state ZIP
and you'll have data in there that's got
address lines City no state but a zip or
maybe city state but no zip or maybe no
City but a state is it you know or maybe
no address line in just a zip it's
you've got to be aware that there's
going to be essentially gaps in the data
and then how do you handle that you know
how do you want to handle that in those
cases because sometimes that's critical
sometimes that data is very important
for you to be able to really uh to
utilize it
uh for example like let's go back to
that job data let's say that part of
your what you do is you're you know
giving a list of
Health Care related jobs by City
if you pull data that doesn't tell you
what city it's the jobs in then
that's going to be it's almost useless
to you because you don't know what city
to put it in
um but in the same situation what
happens that they list it in four
different cities you want to make sure
that that you know maybe that gets
replicated to you know four records one
for each City but then what happens if
somebody searches you know two or three
different cities
do you want to deal with uh seeing that
same job show up twice or more often
foreign