Detailed Notes
This is the wrap-up and Q&A of a short series on data scraping. We cover the topic at a high level and describe the different forms of the solutions. That includes calling an API, downloading reports, and even scraping data a field at a time while crawling a web page. There are tips and tricks included to help you create a high-quality scraping solution and to properly set expectations.
This comes from the Develpreneur.com mentor series of presentations.
Transcript Text
thank you [Music] one of the things that that particularly now because there's just there's just so much data that we deal with is consider options for either like caching some of the data like even locally on the machine that's doing this the scraping or uh some sort you know like if you're doing lookups on a regular basis then you know maybe you want to cash those so you minimize the calls um and maybe you need to do something where you regularly refresh that look up but you're still storing it you know locally or in memory or maybe even into a local data store or out to a file that just eliminates your need to go hit that that website or that API or that RSS feed yet again and you may run into situations like API calls in particular it is becoming you know uh common enough that one there is a limit to the number of calls you can make either you know per month sometimes per day sometimes per hour and you may have to look particularly in your initial like data loading processes it may be something where you're actually you may have to run for days to get all the data that you want to get your initial setup going so you may have to look at ways to do some sort of a you know accounts or time delays or something like that so that if there's a you know of example they say you can only make a thousand calls per hour then you're going to have to have something in there that has a little timer and has an account of uh calls and says okay I've made you know and maybe you can make those count those pretty quick but you know maybe it's something where you say oh I've hit my limit so now I'm going to give it you know a certain amount I'm gonna give it an hour and then I'm gonna go kick it off again or I hit my limit but now you know I can I I know that I hit that limit over a 30 minute period so 35 minutes from now I can start pulling those again or I built a time delay so that I always do you know I I can't do more than a thousand so I have timers in there that's little sleep timers or something like that that guarantees that I will never get more than you know 999 calls in an hour or something on those five so you may have to do some delays or other little tricks to avoid hitting limits sometimes it's just a matter of you're gonna have to do the call and then you have to go to sleep for you know have the system go to sleep for five minutes then do the call again so it doesn't look like you're so you don't crush the system but also you know maybe so you don't get flagged as a super high traffic site uh another thing you can do is look at um time stamps and and update dates and things like that so that you can store when you made your last call and then the next time you make the call put a parameter in there or you know provide a parameter assuming that they allow for that functionality that says well I only want data that's changed since the last time I made the call that makes maybe you know the first time you ever do it maybe that's an expensive call but from then on it's only going to take deltas and therefore maybe is something a little more reasonable to to work with so what have we hopefully learned in this but one is that scraping can come in many forms when you see you know or receive a request to scrape data realize that that may not be actual web scraping you may be able to do that through like an API or RSS feeds or um another thing that I didn't really specifically get into because it's it's sort of a combination of us it's really like a web scrape-ish kind of thing uh sometimes you can just do automation that goes in logs into a site goes out to some report link generates that report saves that file and then you have that file you know whether it's XML Json CSV and you actually instead are parsing you know importing and ingesting that file as opposed to actually going out to the uh you know to the website and scraping and sometimes that is a far more uh effective easier and less fragile approach to take also I think which hopefully it's seen is because I've brought it up so much time is really the the bulk of the work outside of the parse is just is really just going to be mapping data usually there's you know once you get the the parsing done the first time you get the parsing done particularly hitting out wires can be a real pain but after that it tends to be it's like okay I can get the data it's just where do I need to put it and so it becomes you know mapping and making sure that the relationships are are properly maintained formatting your data is going to be a critical piece otherwise it's going to be really hard to uh generalize it so for example you know if you want to compare things that happen within a certain date range you've got to have date formats that match so that you're able to do that Apple's Apple comparison of one date to another uh because of that you're you think of the two key ends there is the the parsers that are going to pull data in and then the formatters that are really converting that data to your uh your format those are going to be you know invaluable tools to have and those are going to be critical for you to be able to have something that gets the right data and spits it out in a way that's usable and as I mentioned whenever you're doing anything like this start simple go with like a single record or you know something like that some of the simplest kind of stuff make sure that you can do that validate your process and then repeat it for you know either more records or for some of the more complicated uh record formats to try to make sure that you cover everything that way you at least have a baseline that works foreign questions and comments so it's kind of funny Rob uh one of the things that you're talking about because I I have to deal with screen scrapers a lot and what exactly are you using selenium for uh let's start with that one so for selenium um I I have used it a lot for scraping where um I'll go in fire up selenium and I will record uh activity and then turn around and with that from selenium then you can do a it has like it's like a bunch of different um I see do I have someone even fired up here real quick I got it installed on this guy um shoot I don't think I have it so are you using it for the screen recording capabilities to get the IDS and that that you need to get to where you're going and then you just translate that into Python A lot of times yeah it actually like I use Python a lot because but like selenium will create it for you selenium will generate the script in python or Java or there's like or PHP whatever it is there's a bunch of different languages that you know basically what you do is you record yourself you know you record your activity and then tell it to generate the code to do that and it will generate for you that you know that PHP script or that python script that does exactly what you just did and that at least gives me a basically it gives me a starting point because sometimes it's not going to give the you know like it'll use like say XPath uh although usually that's gonna be pretty good if it's using like the selector CSS selector then that may not work for all of the the records that I want to do and this again typically is because it's going to do it for I'm going to do it for a single case so for example you know like that job site thing is I would record myself going to monster.com select it you know running a search open clicking on a job and then want to be able to parse that job data and suck it into my my system and then I would take that and basically want to generalize it so looking at what you know what I've been given in the script it's basically like a it sort of kick-starts that process so now I've got some of the the basics of it built in and then I can go in and I can you know tweak the IDS and and some of the navigational stuff in that but it definitely speeds up the whole scraping process so you're using it for the record playback feature and the generation of the code all right yeah no that makes sense I was just curious in that instance that makes sense um I was just trying to figure out why you were using that versus just using python but that makes sense so you basically build the script and then you modify the script which is what we do for testing yeah as I get I basically get selenium to write the python code for me for the most part and then I go back and you know make the adjustments as needed anyways right so here's an kind of a odd observation from my nugget to what you're talking about so have you considered possibly doing could you do some of this with like those streams and batching processes where you write something that would essentially trigger a script or like for your RSS feed so you could pull in the RSS feed through your stream do that kind of scraping formatting and then just inject it into your system without having to go the other path what's the other path oh where you have to actually write your own uh you know take the art like you have to write something go hit the rxs feed whereas if you use the stream process you just point it to that and say okay go consume this like every you know day every hour um yeah you can I mean that's it's really it's a different um problem that you're solving in that case because that would you can scrape and do it for a single source typically what you're doing is you you want to you're not just um you know on the Fly displaying that you're usually what you're doing is you're storing that information you're doing some sort of a um you're doing something with that data so it's not just um it's not just that you're you're redisplaining that you're like taking it just reformatting it what you're actually doing is you're taking it usually from multiple sources and [Music] then you're converting that into something that uh some you know General format that then you're going to be kicking back out to each other so like for the RSS feed it would not be a single feed you're actually going out to multiple feeds and then you're probably you know maybe you've got some additional data around that like hey I've shown this to that person before or uh some other additional data to it it's not just a uh that's a different um problem set or you know context is the idea of okay I've got a stream of data that I want to somehow display maybe in a more user-friendly you know fashion or that I'm going to do some some basic processing with it uh maybe do some lookups or something like that to give it some more context but then just basically you know display the screen that stream with just uh in a different context is different from scraping and pulling the data in and then now it's my day because when I scrape it uh I did like streaming it's really still not my data I'm just tweaking the data as it gets displayed uh with scraping it's like this is my data or I'm getting all these sources and I'm going to make it at my data so I'm going to do some sort of value add that before I kick that back out to my users okay all right that kind of makes sense but you could do a uh consumer that essentially does that with this stream as well that's what we're doing uh where I'm at we're taking in a lot of fees converting it and putting it into our system into our data so that's why I was kind of curious yeah so in your case I mean that's that to me is more yeah I mean you're it's a it would be in a general sense in that case it would be a scraping thing but you're not uh in that case but that's also where you're storing it so you're not just um you know redirecting that out you're taking all of those disparate sources and converting them into your data warehouse or whatever it is and then being able to combine those essentially to combine those data pieces and then uh you know send those back out to your users or do your systems in a consistent format okay that makes sense yeah it was just it was just kind of an interesting correlation I never thought of it that way before until you presented this for it after I talked about data streams yeah and it is like I didn't really I didn't go into streams specifically uh because those there's some other little those are slightly different because they're because there's a little more complexity in dealing with like a stream than just an RSS feed but at the end of the day it's there's definitely a lot of similarities between those uh see other questions comments yeah uh question what is what is the use of scripting does it make the website faster or are you just identifying abnormalities on the website uh for scripting in the scraping side of it yeah uh it's really it's to automate it it's so that you're not having to go out and click and do stuff so it's it's really it's in the day program it's it's allowing the YouTube programmatically go out and surf a website and then grab pieces of data and do stuff with them so okay so what do you use that data for like as a as a developer like it's scrapping is not scripting isn't it um I mean you can use the data for just about anything like for example like healthcare there are healthcare companies that will go out and scrape data or you know in some way form or fashion a lot of times it's um they've got some sort of like you know feeds or streams that they're looking at but then they're going to take that data and analyze it and kick it back out and say hey here's um here's some current risks that that are showing up in health care work that's being going on or here's um you know how we're how we're seeing insurance claims being processed or yeah things like that is that there's a lot of different ways that you may utilize that data and it's really it's uh it's yeah it's you saying Hey I want to provide this service or this solution for my users but it is more valuable if I have data outside of what I currently have so I'm going to go out and scrape all of these other sources to pull that data in and provide you know bigger banging for the buck to my customers okay all right got it so it's kind of like collecting uh user data isn't it yeah okay it's collecting data of of whatever it may be so like a job site would be I said maybe you you want to have your you want to create a job site that you're gonna that people come visit your site and be able to see what jobs are available well you know instead of you going out and data entering thousands of jobs and finding out where those jobs are you can go build a bunch of scrapers that go out to various sites where you know that they you know they publish jobs that are available and then pull that information into your site got it thank you other questions or comments all right so it brings us to the all-important thank you so appreciate the time sort of listening through this and just you know thinking about some of these things it's a little different uh for what some people you know may have thought about scraping so hopefully it's it's been useful to you uh as always if you have any questions or comments that you know come up after the the presentation asks sometimes they do the interesting email info developer.com we have a contact desk perform out uh contact us form out at developmentor.com as well you can follow us on Twitter at developreneur or DMS air and then also we've got a Facebook page uh developer out on Facebook so we just uh we appreciate doing this appreciate your time as we just uh work together try to make every developer better having a good day foreign
Transcript Segments
thank you
[Music]
one of the things that that particularly
now because there's just there's just so
much data that we deal with is consider
options for either like caching some of
the data like even locally on the
machine that's doing this the scraping
or
uh some sort you know like if you're
doing lookups on a regular basis then
you know maybe you want to cash those so
you minimize the calls
um and maybe you need to do something
where you regularly refresh that look up
but you're still storing it you know
locally or in memory or maybe even into
a local data store or out to a file that
just eliminates your need to go hit that
that website or that API or that RSS
feed yet again
and you may run into situations like API
calls in particular it is becoming you
know uh common enough that one there is
a limit to the number of calls you can
make either you know per month sometimes
per day sometimes per hour
and you may have to look particularly in
your
initial like data loading processes it
may be something where you're actually
you may have to run for days to get all
the data that you want to get your
initial setup going so you may have to
look at ways to do some sort of a you
know accounts or time delays or
something like that so that if there's a
you know of example they say you can
only make a thousand calls per hour
then you're going to have to have
something in there that has a little
timer and has an account of uh calls and
says okay I've made you know and maybe
you can make those count those pretty
quick but you know maybe it's something
where you say oh I've hit my limit so
now I'm going to give it you know a
certain amount I'm gonna give it an hour
and then I'm gonna go kick it off again
or I hit my limit but now
you know I can
I I know that I hit that limit over a 30
minute period so 35 minutes from now I
can start pulling those again or
I built a time delay so that I always do
you know I I can't do more than a
thousand so I have timers in there
that's little sleep timers or something
like that that guarantees that I will
never get more than you know 999 calls
in an hour or something on those five so
you may have to do some delays or other
little tricks to avoid hitting limits
sometimes it's just a matter of you're
gonna have to do the call and then you
have to go to sleep for you know have
the system go to sleep for five minutes
then do the call again so it doesn't
look like you're so you don't crush the
system but also you know maybe so you
don't get flagged as a super high
traffic site
uh another thing you can do is look at
um time stamps and and update dates and
things like that so that you can store
when you made your last call
and then the next time you make the call
put a parameter in there or you know
provide a parameter assuming that they
allow for that functionality that says
well I only want data that's changed
since the last time I made the call
that makes maybe you know the first time
you ever do it maybe that's an expensive
call but from then on it's only going to
take deltas and therefore maybe is
something a little more reasonable to to
work with
so what have we hopefully learned in
this
but one is that scraping can come in
many forms when you see you know or
receive a request to scrape data realize
that that may not be actual web scraping
you may be able to do that through like
an API or RSS feeds or
um another thing that I didn't really
specifically get into because it's it's
sort of a combination of us it's really
like a web scrape-ish kind of thing uh
sometimes you can just do automation
that goes in logs into a site goes out
to some report link generates that
report saves that file and then you have
that file you know whether it's XML Json
CSV
and you actually instead are parsing you
know importing and ingesting that file
as opposed to actually going out to the
uh you know to the website and scraping
and sometimes that is a far more uh
effective easier and less fragile
approach to take
also I think which hopefully it's seen
is because I've brought it up so much
time is really the the bulk of the work
outside of the parse is just is really
just going to be mapping data usually
there's
you know once you get the the parsing
done the first time you get the parsing
done particularly hitting out wires can
be a real pain but after that it tends
to be it's like okay I can get the data
it's just where do I need to put it and
so it becomes you know mapping and
making sure that the relationships are
are properly maintained
formatting your data is going to be a
critical piece otherwise it's going to
be really hard to uh generalize it so
for example you know if you want to
compare things that happen within a
certain date range you've got to have
date formats that match so that you're
able to do that Apple's Apple comparison
of one date to another
uh because of that you're you think of
the two key ends there is the the
parsers that are going to pull data in
and then the formatters that are really
converting that data to your uh your
format those are going to be you know
invaluable tools to have and those are
going to be critical for you to be able
to have something that gets the right
data and spits it out in a way that's
usable
and as I mentioned
whenever you're doing anything like this
start simple go with like a single
record or you know something like that
some of the simplest kind of stuff make
sure that you can do that validate your
process and then repeat it for you know
either more records or for some of the
more complicated uh record formats to
try to make sure that you cover
everything that way you at least have a
baseline that works
foreign questions and comments
so it's kind of funny Rob uh one of the
things that you're talking about because
I I have to deal with screen scrapers a
lot and
what exactly are you using selenium for
uh let's start with that one
so for selenium
um I I have used it a lot for scraping
where
um
I'll go in fire up selenium and I will
record
uh activity
and then turn around and with that from
selenium then you can do a it has like
it's like a bunch of different
um
I see do I have someone even fired up
here real quick I got it installed on
this guy
um
shoot I don't think I have it so are you
using it for the screen recording
capabilities to get the IDS and that
that you need to get to where you're
going and then you just translate that
into Python A lot of times yeah it
actually like I use Python a lot because
but like selenium will create it for you
selenium will generate the script
in python or Java or there's like or PHP
whatever it is there's a bunch of
different languages that you know
basically what you do is you record
yourself you know you record your
activity
and then tell it to generate the code to
do that and it will generate for you
that you know that PHP script or that
python script that does exactly what you
just did
and that at least gives me a
basically it gives me a starting point
because sometimes it's not going to give
the you know like it'll use like say
XPath uh although usually that's gonna
be pretty good if it's using like the
selector CSS selector then that may not
work for all of the the records that I
want to do
and this again typically is because it's
going to do it for
I'm going to do it for a single case
so for example you know like that job
site thing is I would record myself
going to monster.com select it you know
running a search
open clicking on a job and then
want to be able to parse that job data
and suck it into my my system
and then I would take that and basically
want to generalize it so looking at what
you know what I've been given in the
script it's basically like a it sort of
kick-starts that process so now I've got
some of the the basics of it built in
and then I can go in and I can you know
tweak the IDS and and some of the
navigational stuff in that but it
definitely speeds up the whole scraping
process
so you're using it for the record
playback feature and the generation of
the code all right yeah no that makes
sense I was just curious
in that instance that makes sense
um I was just trying to figure out why
you were using that versus just using
python but that makes sense so you
basically build the script and then you
modify the script which is what we do
for testing yeah as I get I basically
get selenium to write the python code
for me for the most part and then I go
back and you know make the adjustments
as needed
anyways
right so here's an
kind of a odd observation from my nugget
to what you're talking about so have you
considered possibly doing could you do
some of this with like those streams and
batching processes where you write
something that would essentially trigger
a script or like for your RSS feed so
you could pull in the RSS feed through
your stream do that kind of scraping
formatting and then just inject it into
your system without having to go the
other path
what's the other path
oh where you have to actually write your
own uh you know take the art like you
have to write something go hit the rxs
feed whereas if you use the stream
process you just point it to that and
say okay go consume this like every you
know day every hour
um yeah you can I mean that's it's
really it's a different
um problem that you're solving in that
case because that would you can scrape
and do it for a single source
typically what you're doing is you you
want to you're not just
um you know on the Fly displaying that
you're usually what you're doing is
you're storing that information you're
doing some sort of a
um you're doing something with that data
so it's not just
um it's not just that you're you're
redisplaining that you're like taking it
just reformatting it what you're
actually doing is you're taking it
usually from multiple sources and
[Music]
then you're converting that into
something that uh some you know General
format that then you're going to be
kicking back out to each other so like
for the RSS feed it would not be a
single feed you're actually going out to
multiple feeds and then you're probably
you know maybe you've got some
additional data around that like hey
I've shown this to that person before or
uh some other additional data to it it's
not just a uh that's a different
um
problem set or you know context is the
idea of okay I've got a stream of data
that I want to somehow display maybe in
a more user-friendly you know fashion or
that I'm going to do some some basic
processing with it uh maybe do some
lookups or something like that to give
it some more context but then just
basically you know display the screen
that stream with just uh in a different
context is different from scraping and
pulling the data in and then now it's my
day because when I scrape it uh I did
like streaming it's really still not my
data I'm just tweaking the data as it
gets displayed uh with scraping it's
like this is my data
or I'm getting all these sources and I'm
going to make it at my data so I'm going
to do some sort of value add that before
I kick that back out to my users
okay all right that kind of makes sense
but you could do a uh consumer that
essentially does that with this stream
as well that's what we're doing uh where
I'm at we're taking in a lot of fees
converting it and putting it into our
system into our data so that's why I was
kind of curious yeah so in your case I
mean that's that to me is more yeah I
mean you're it's a it would be in a
general sense in that case it would be a
scraping thing but you're not uh in that
case but that's also where you're
storing it so you're not just
um you know redirecting that out you're
taking all of those disparate sources
and converting them into your
data warehouse or whatever it is and
then being able to
combine those essentially to combine
those data pieces and then uh you know
send those back out to your users or do
your systems in a consistent format
okay that makes sense yeah it was just
it was just kind of an interesting
correlation I never thought of it that
way before until
you presented this for it after I talked
about data streams yeah and it is like I
didn't really I didn't go into streams
specifically uh because those
there's some other little
those are slightly different because
they're because there's a little more
complexity in dealing with like a stream
than just an RSS feed but at the end of
the day it's
there's definitely a lot of similarities
between those
uh see other questions comments yeah uh
question what is what is the use of
scripting does it make the website
faster or are you just identifying
abnormalities on the website
uh for scripting in the scraping side of
it yeah
uh it's really it's to automate it it's
so that you're not having to go out and
click and do stuff so it's it's really
it's in the day program it's it's
allowing the YouTube programmatically go
out and surf a website and then grab
pieces of data and do stuff with them
so okay so what do you use that data for
like as a as a developer like it's
scrapping is not scripting isn't it
um I mean you can use the data for just
about anything like for example like
healthcare there are healthcare
companies that will go out and scrape
data or you know in some way form or
fashion a lot of times it's um they've
got some sort of like you know feeds or
streams that they're looking at but then
they're going to take that data and
analyze it and kick it back out and say
hey here's
um here's some current risks that that
are showing up in health care work
that's being going on or here's
um you know how we're how we're seeing
insurance claims being processed or yeah
things like that is that there's a lot
of different ways that you may utilize
that data and it's really it's uh it's
yeah it's you saying Hey I want to
provide this service or this solution
for my users but it is more valuable if
I have data outside of what I currently
have so I'm going to go out and scrape
all of these other sources to pull that
data in and provide you know bigger
banging for the buck to my customers
okay all right got it so it's kind of
like collecting uh user data isn't it
yeah okay it's collecting data of of
whatever it may be so like a job site
would be I said maybe you you want to
have your you want to create a job site
that you're gonna that people come visit
your site and be able to see what jobs
are available well you know instead of
you going out and data entering
thousands of jobs and finding out where
those jobs are you can go build a bunch
of scrapers that go out to various sites
where you know that they you know they
publish jobs that are available
and then pull that information into your
site
got it thank you
other questions or comments
all right so
it brings us to the all-important thank
you so appreciate the time sort of
listening through this and just you know
thinking about some of these things it's
a little different
uh for what some people you know may
have thought about scraping so hopefully
it's it's been useful to you uh as
always if you have any questions or
comments that you know come up after the
the presentation asks sometimes they do
the interesting email info developer.com
we have a contact desk perform out uh
contact us form out at developmentor.com
as well
you can follow us on Twitter at
developreneur or DMS air and then also
we've got a Facebook page uh developer
out on Facebook so
we just uh we appreciate doing this
appreciate your time as we just uh work
together try to make every developer
better
having a good day
foreign