Detailed Notes
This is part one of a short series on data scraping. We cover the topic at a high level and describe the different forms of the solutions. That includes calling an API, downloading reports, and even scraping data a field at a time while crawling a web page. There are tips and tricks included to help you create a high-quality scraping solution and to properly set expectations.
This comes from the Develpreneur.com mentor series of presentations.
Transcript Text
foreign [Music] for this presentation it really came from a lot of recent requests and job postings that I've more like I guess Consulting project postings that I've seen where it seems like data is something that people are now aware of and they realize that there are ways to get it free out on the Internet or you know sometimes you have to have a some sort of a um a membership or you pay to log into the site or something like that but rather than rely on whatever a site or provider has sometimes it's easier to just scrape and not use their uh you know their interface and so that's that's really where this presentation comes from is looking at the different options that you have so if you're in a situation where you know you want to or you need to scrape data then you may have something in your mind that is scraping that is not as simple as the solution that you do have available and so that's where I figured we would take this one and some of this is going to be so starting with an overview but we're going to talk about some data sources like apis actually scraping off of web pages RSS feeds you know very similar to apis and then sort of wrap up with some tricks tips and trip up sort of the things that you need to think about if you're going to jump into some sort of a scraping project so in general a lot of times people think of scraping as back to the old school trying to find a way to get data off of um mainframes and get that into some sort of modern database and that was what was done was you know called scraping and what you would do is you would have something that would log in and then based on it you know go into the display go you know X characters over Y characters down pull that and then suck that into a database and it included things like with scraping it's usually you can essentially the equivalent of like a screenshot so you're pulling data off the screen and then you're parsing that and then pushing it somewhere but also a lot of times it includes interaction so you're entering data you're clicking buttons often uh seen with uh like some of the web crawling and things like that as well or spidering a website is often going to be you know go sort of hand in hand with the idea of scraping but the goal of it really is a scraping project or doing that functionality you're really just making use of usually it's publicly available data sometimes it's something that's you know behind some sort of a paywall but it's still using somebody else's data and sourcing that in a way that you can use it if you think of yourself maybe even as a Clearinghouse of data that you can go in you can format you can um you can do some sort of manipulations of it calculations based on it group it yeah you name it but now you're taking all this data and repurposing it essentially usually or sometimes you'll find that you can actually have multiple sources so you may be validating data or maybe it's you know maybe it's regional or something like that so you get slightly different data from these various sources and you're finding some way to sort of merge those in a lot of times you're going to have different formats your target is you know your database your data warehouse but the format of the data and even the consistency of the data that you pull in can be very different in some cases for example in some cases you may get an entire address in city state and zip and in others you get you know maybe just a name or a street or something like that or maybe just a zip code a good example to think of for the multiple formats is think about if you've ever gone out to job sites particularly like you know surf three or four of them if you look at a like a monster.com versus uh uh prg.jobs.com or indeed.com or ladders.com all of these sites effectively have job data but it's going to be different some will have different levels of descriptions they'll have different categories all kinds of things like that some will tell you who the customer is or who the company is some won't so all of those are going to be multiple formats like I said your goal is that you're going to repurpose this for whatever your solution happens to be and let's just take like a if it's a the job example so maybe you want to cite that allows people to just see what's out there uh all of the let's say you know Medical Health Care related jobs so you could go to all of these different sites and essentially scrape data off of them depending on how they're set up but basically go scrape their data and instead of people having to go to all those sites and do you know searches across all of them you are basically doing the search Forum taking the results and then you know you can repurpose it reformat it do it in a way that's consistent so they can more easily do apples to apples comparisons and then provide that for for your users so a lot of it is um the ability maybe to share data that otherwise would be very difficult to share or to just provide easier access you know particularly if you're talking about a lot of different data sources it can be a pain to go out to 10 different places and there's definitely something you know there is a value for not having to do that to be able to come to one place and see all that data if you think of RSS feed readers those apps that's exactly what it does is it just says instead of you having to go look at all these feeds and then you know go to these sites and read their articles then we'll just go hit all of these RSS feeds we'll pull those articles in and then you can read them in in one easy place so that's sort of that's you know a summary sort of what scraping looks like so when you talk about ways and this goes back to uh the many data sources the probably the easiest way to pull data is via apis if you have uh data sources that provide an application programming interface an API then you can get a lot of data that's going to be structured that's going to be clean you usually can do a lot of filtering and send parameters into the query to sort of limit stuff to what matters to you Now with an API it may be public so it may just be a matter of hitting a hitting an address and it's going to kick some information back you may need some sort of an access key or authentication that is definitely not uncommon these days particularly for a lot of cloud-based software they do have apis but they also require you to have at least some sort of access key or a developer key and that's really so people don't just you know Crush their systems with requests uh with an API once you've got that key then it's it's really it's what are the parameters what is the data that you want to use most of these are going to be documented uh sometimes in very you know very solid documentation so um let's see I don't know if I've got one up see uh out there to a quick apologies because then I think about this till just now but I may have one uh let's see yeah it's just gonna give me a yeah so you may see stuff like this as you can see some pretty solid documentation about API so this would be things like you know what's your call you can see what are the parameters which is you know maybe I don't want to bring everything in maybe I only want to you know do data for the last month or maybe I want to do only data since the last time I made a call uh and then you're going to see nice formatted stuff so you're going to see column names you're going to see data types things like that that make it fairly clean and easy to use an API because then it's just a matter of you know essentially it's just simple data mapping at that point you know what you're getting in from the API and then where do you want to store that but with an API because of how they're set up you're going to want to think about your frequency how often are you going to need to do a call do you need to do data do you need to update it essentially real time so like every five minutes do you need to do it once an hour or do you need to do it once a day do you need to do it once a week that kind of stuff's going to vary quite a bit when you think of again think of an example of job sites you may want to do that every hour so that there's fresh jobs showing up and bring people back but maybe you know your customers are only typically going to do uh they're only typically going to be looking for jobs maybe on the weekend and so then maybe only do it like you know once or twice a week maybe every you know Friday you or maybe Thursday night you go out and get all the latest stuff and then that's you know that covers you for the next week the you know the work of most apis it's connecting is usually pretty easy so then it's really comes down to mapping the results deciding which of the data that the API provides you that you need and then mapping that into places to store in your database now sometimes it may be um we'll call it like child data or related data so you may have to make a call pull all your data in and then there may be some sort of like an ID that you're going to have to go back and use that list of IDs to make additional calls to get more information so for example let's say you've got um you've got companies that you're you're pulling information in on and each of those when you call the companies they have a primary contact but that's basically all they have is just have like and maybe it's a name and an ID but also within the API they have a way to get contact information like a address and email and phone and things like that so in that case you may have to do a you know a multi-pass where you would first get all of the companies store all that information where you need wherever you need to and then go back for all of those contact IDs make calls or a call and then you know match the contact IDs depending on how it works to make sure you store your data and and the relationships within those and that means sometimes you're going to have to do uh you may have to do some lookups or some sort of an ID caching or something like that because you'll have for example let's say you're bringing companies in and you've just got a a site that's displaying company information well when that data updates then you want to make sure that instead of on your you know your subsequent call instead of creating yet another customer record you want to be able to know what is that ID so I can go update the correct data and in my system and particularly if you're doing multiple systems multiple data sources then you're going to want to have some way to know what was that probably what was that data source what was the ID in that system and then you're going to separately have your own ID within your system so you're going to have to be able to you know map that in some way and you may do that again you may have to do that by like pulling a bunch of data caching a list of IDs or something like that and then making calls based on that or using that to do sort of on-the-fly mappings or you may do it by you know like storing data source and external ID things of that nature but apis give us the key is apis give us the cleanest uh typically the cleanest way to get our data now scraping is going to be a little more challenging because this is where we're going to be going out to a website and pulling data so for an example maybe I want to go to the developreneur site and look at uh look at the blog and maybe I want to go through here and I want to have a scraper that goes through and I want to go to this site and I want to pull all the content you know basically all the titles maybe the date it was posted that kind of information maybe the category um there's a lot of that you can get because you could get it from either scraping we actually have in this case you can do an RSS feed but um it may be something where that doesn't exist where there is no RSS feed so you end up having to go in and scrape which means you're going to have to go in and you're gonna have to look into this you know looking at the source you're gonna have to go in and find a way to navigate down into that source to say Okay I want this uh uh for me to get this title which is probably hard to see I'm gonna have to go in and get this little link here and I'm going to have to get the text out of this you know this anchor so it can be challenging but you can get it exactly if you need to usually exactly how a database has it so usually that's going to involve with scraping that's going to be some sort of a home page or maybe even a login that you're going to have to do and this is going to be automated through something like a lot of times through like selenium or you can't I mean selenium gives you an easy way to record and then generate code to do these things but you also can you know use whatever your language of choice is whether it's Java or python or Ruby or whatever as long as it can do a post then you can post out to the site bring the information back and then with the resulting data you can start walking through it now with that you're probably going to need some sort of keys or IDs or path we've mentioned before things like I can take something I think I can go from here and I can do uh let's see that's what oh it's here Maybe yeah here we go so I can so here I can copy like you know there's we have selector paths and xpaths but if I take that X path and um let me just go here it looks something like this you know it's not the easiest thing but it is a way to get you based on like an ID it's going to give me to where I need to go now this is not going to be really useful um this specific example because IDs are going to change you're going to have to have a way to walk through all of in this structure whatever that is so the uh what is that that's probably a post that's a post Loop so yeah everything that's a you know a list post you're gonna have to get all that kind of stuff so it can be time consuming but it also doesn't require the customer to to provide you much um so you're you know you're gonna go in you're gonna figure out what your IDs are what your path is what data are you going to pull a lot of times you're going to have to repeat that because there's going to be you know lists of data on your site that you're scraping uh you're probably going to have to do some some link navigating like in this case if I wanted to get more information I would have to you know click on the link and then pull you know to pull like the full data I would have to pull it there thank you
Transcript Segments
foreign
[Music]
for this presentation it really came
from a lot of recent requests and job
postings that I've more like I guess
Consulting project postings that I've
seen
where it seems like data
is something that people are now aware
of and they realize that there are ways
to get it free out on the Internet or
you know sometimes you have to have a
some sort of a um
a membership or you pay to log into the
site or something like that
but rather than rely on whatever a site
or provider has
sometimes it's easier to just scrape and
not use their uh you know their
interface and so that's that's really
where this presentation comes from
is looking at
the different options that you have so
if you're in a situation where you know
you want to or you need to scrape data
then
you may have something in your mind that
is scraping that
is not as simple as the solution that
you do have available and so that's
where I figured we would take this one
and some of this is going to be so
starting with an overview but we're
going to talk about some data sources
like apis actually scraping off of web
pages RSS feeds you know very similar to
apis and then sort of wrap up with some
tricks tips and trip up sort of the
things that you need to think about if
you're going to jump into some sort of a
scraping project
so in general
a lot of times people think of scraping
as back to the old school trying to find
a way to get data off of
um mainframes and get that into some
sort of modern database and that was
what was done was you know called
scraping and what you would do is you
would have something that would log in
and then based on it you know go into
the display
go you know X characters over Y
characters down pull that and then suck
that into a database and it included
things like with scraping it's usually
you can essentially the equivalent of
like a screenshot so you're pulling data
off the screen and then you're parsing
that and then pushing it somewhere but
also a lot of times it includes
interaction so you're entering data
you're clicking buttons
often uh seen with uh like some of the
web crawling and things like that as
well or spidering a website is often
going to be you know go sort of hand in
hand with the idea of scraping
but the goal of it really is a scraping
project or doing that functionality
you're really just making use of usually
it's publicly available data
sometimes it's something that's you know
behind some sort of a paywall but it's
still using somebody else's data and
sourcing that in a way that you can use
it if you think of yourself maybe even
as a Clearinghouse of data that you can
go in you can format you can
um you can do some sort of manipulations
of it calculations based on it group it
yeah you name it but now you're taking
all this data and repurposing it
essentially
usually or sometimes you'll find that
you can actually have multiple sources
so you may be validating data or maybe
it's you know maybe it's regional or
something like that so you get slightly
different data from these various
sources and you're finding some way to
sort of merge those in
a lot of times you're going to have
different formats your target is you
know your database your data warehouse
but the format of the data and even the
consistency of the data that you pull in
can be very different in some cases for
example in some cases you may get an
entire address in city state and zip and
in others you get you know maybe just a
name or a street or something like that
or maybe just a zip code
a good example to think of
for the multiple formats is think about
if you've ever gone out to job sites
particularly like you know surf three or
four of them if you look at a like a
monster.com versus uh uh
prg.jobs.com or indeed.com or
ladders.com all of these sites
effectively have job data
but it's going to be different some will
have different levels of descriptions
they'll have different categories all
kinds of things like that some will tell
you who the customer is or who the
company is some won't so all of those
are going to be multiple formats
like I said your goal is that you're
going to repurpose this for whatever
your solution happens to be and let's
just take like a if it's a the job
example so maybe you want to cite that
allows people to just see what's out
there uh all of the let's say you know
Medical Health Care related jobs
so you could go to all of these
different sites and essentially scrape
data off of them depending on how
they're set up but basically go scrape
their data and instead of people having
to go to all those sites and do you know
searches across all of them you are
basically doing the search Forum taking
the results and then you know you can
repurpose it reformat it do it in a way
that's consistent so they can more
easily do apples to apples comparisons
and then provide that for for your users
so a lot of it is
um
the ability maybe to share data that
otherwise would be very difficult to
share or to just provide easier access
you know particularly if you're talking
about a lot of different data sources it
can be a pain to go out to 10 different
places and there's definitely something
you know there is a value for not having
to do that to be able to come to one
place and see all that data if you think
of RSS feed readers those apps that's
exactly what it does is it just says
instead of you having to go look at all
these feeds and then you know go to
these sites and read their articles then
we'll just go hit all of these RSS feeds
we'll pull those articles in and then
you can read them in in one easy place
so that's sort of that's you know a
summary sort of what scraping looks like
so when you talk about ways and this
goes back to uh the many data sources
the probably the easiest way
to pull data is via apis if you have uh
data sources that provide an application
programming interface an API then
you can get a lot of data that's going
to be structured that's going to be
clean you usually can do a lot of
filtering and send parameters into the
query to sort of limit stuff to what
matters to you
Now with an API it may be public so it
may just be a matter of hitting a
hitting an address and it's going to
kick some information back you may need
some sort of an access key or
authentication that is definitely not
uncommon these days particularly for a
lot of cloud-based software they do have
apis but they also require you to have
at least some sort of access key or a
developer key
and that's really so people don't just
you know Crush their systems with
requests
uh with an API once you've got that key
then it's it's really it's what are the
parameters what is the data that you
want to use most of these are going to
be documented uh sometimes in very you
know very solid documentation so
um let's see I don't know if I've got
one up
see
uh out there
to a quick
apologies because then I think about
this till just now but I may have one
uh let's see
yeah it's just gonna give me a yeah
so you may see stuff like this as you
can see some pretty solid documentation
about API so this would be things like
you know what's your call you can see
what are the parameters which is you
know maybe I don't want to bring
everything in maybe I only want to you
know do data for the last month or maybe
I want to do only data since the last
time I made a call
uh and then you're going to see nice
formatted stuff so you're going to see
column names you're going to see data
types things like that that make it
fairly clean and easy to use an API
because then it's just a matter of you
know essentially it's just simple data
mapping at that point you know what
you're getting in from the API and then
where do you want to store that
but with an API because of how they're
set up you're going to want to think
about your frequency how often are you
going to need to do a call do you need
to do data do you need to update it
essentially real time so like every five
minutes do you need to do it once an
hour or do you need to do it once a day
do you need to do it once a week that
kind of stuff's going to vary quite a
bit when you think of again think of an
example of job sites
you may want to do that every hour so
that there's fresh jobs showing up and
bring people back but maybe you know
your customers are only typically going
to do uh they're only typically going to
be looking for jobs maybe on the weekend
and so then maybe only do it like you
know once or twice a week maybe every
you know Friday you or maybe Thursday
night you go out and get all the latest
stuff and then that's you know that
covers you for the next week
the you know the work of most apis it's
connecting is usually pretty easy
so then it's really comes down to
mapping the results deciding which of
the data that the API provides you that
you need and then mapping that into
places to store in your database now
sometimes it may be
um we'll call it like child data or
related data so you may have to make a
call pull all your data in and then
there may be some sort of like an ID
that you're going to have to go back and
use that list of IDs to make additional
calls to get more information
so for example
let's say you've got
um
you've got companies that you're you're
pulling information in on and each of
those when you call the companies they
have a primary contact but that's
basically all they have is just have
like and maybe it's a name and an ID but
also within the API they have a way to
get contact information like a address
and email and phone and things like that
so in that case you may have to do a you
know a multi-pass where you would first
get all of the companies store all that
information where you need wherever you
need to and then go back for all of
those contact IDs make calls or a call
and then you know match the contact IDs
depending on how it works to make sure
you store your data and and the
relationships within those
and that means sometimes you're going to
have to do uh you may have to do some
lookups or some sort of an ID caching or
something like that because you'll have
for example let's say you're bringing
companies in and you've just got a a
site that's displaying company
information well when that data updates
then you want to make sure that instead
of on your you know your subsequent call
instead of creating yet another customer
record you want to be able to know what
is that ID so I can go update the
correct data and in my system
and particularly if you're doing
multiple systems multiple data sources
then you're going to want to have some
way to know what was that probably what
was that data source what was the ID in
that system and then you're going to
separately have your own ID within your
system so you're going to have to be
able to you know map that in some way
and you may do that again you may have
to do that by like
pulling a bunch of data caching a list
of IDs or something like that and then
making calls based on that or using that
to do sort of on-the-fly mappings or you
may do it by you know like storing data
source and external ID things of that
nature
but apis give us
the key is apis give us the cleanest uh
typically the cleanest way to get our
data
now scraping is going to be a little
more challenging because this is where
we're going to be going out to a website
and pulling data so for an example
maybe I want to go to
the developreneur site
and look at uh look at the blog
and maybe I want to go through here and
I want to have a scraper that goes
through and I want to go to this site
and I want to pull all the content you
know basically all the titles maybe the
date it was posted that kind of
information maybe the category
um there's a lot of that you can get
because you could get it from either
scraping we actually have in this case
you can do an RSS feed but
um it may be something where that
doesn't exist where there is no RSS feed
so you end up having to go in and scrape
which means you're going to have to go
in and
you're gonna have to look into this you
know looking at the source you're gonna
have to go in and find a way to navigate
down into that source to say Okay I want
this
uh uh for me to get this title which is
probably hard to see I'm gonna have to
go in and get this little link here and
I'm going to have to get the text out of
this you know this anchor
so it can be
challenging but you can get it exactly
if you need to usually exactly how a
database has it so usually that's going
to involve with scraping that's going to
be some sort of a home page or maybe
even a login that you're going to have
to do and this is going to be automated
through something like a lot of times
through like selenium or you can't I
mean selenium gives you an easy way to
record and then generate code to do
these things but you also can you know
use whatever your language of choice is
whether it's Java or python or Ruby or
whatever
as long as it can do a post
then you can post out to the site bring
the information back and then with the
resulting data you can start walking
through it now with that you're probably
going to need some sort of keys or IDs
or path we've mentioned before things
like I can take something I think I can
go from here and I can do
uh let's see that's what oh it's here
Maybe
yeah here we go so I can so here I can
copy like you know there's we have
selector paths and xpaths but if I take
that X path
and
um let me just go here
it looks something like this you know
it's not the easiest thing but it is a
way to get you based on like an ID it's
going to give me to where I need to go
now this is not going to be really
useful
um this specific example because IDs are
going to change you're going to have to
have a way to walk through all of
in this structure whatever that is so
the
uh what is that that's probably a post
that's a post Loop so yeah everything
that's a you know a list post you're
gonna have to get all that kind of stuff
so it can be time consuming but it also
doesn't require the customer to to
provide you much
um so you're you know you're gonna go in
you're gonna figure out what your IDs
are what your path is what data are you
going to pull a lot of times you're
going to have to repeat that because
there's going to be you know lists of
data on your site that you're scraping
uh you're probably going to have to do
some some link navigating like in this
case if I wanted to get more information
I would have to you know click on the
link and then pull you know to pull like
the full data I would have to pull it
there
thank you