Building Better Developers

Detailed Notes

This is part two of a short series on data scraping. We cover the topic at a high level and describe the different forms of the solutions. That includes calling an API, downloading reports, and even scraping data a field at a time while crawling a web page. There are tips and tricks included to help you create a high-quality scraping solution and to properly set expectations.

This comes from the Develpreneur.com mentor series of presentations.

Transcript Text

foreign
[Music]
so you're probably going to have to do
some you know some additional posts find
links click on those essentially post
that get the data back
and then you may even need to do things
like you know trigger some searches or
filters or even do delays so I may want
to
um let's say well let me just do like a
Best Buy
it's like we've seen that before so I
could go out to Best Buy
and I think this is actually one example
we used in the past so let's say I want
to look for like PS5 consoles so I
what I'm doing here is why we need the
application to do so it would say Okay
trigger the search and then it's going
to have to walk through all of the you
know result items that it wants and then
I'm going to have to scrape those grab
the data and
um you know kick it out into my
uh into my database or data store
so that's scraping it can be I said it
can be time consuming it can and it can
be very fragile because if they change
their site layout and format then it
basically is going to break the scraping
piece because you're assuming that
things follow a certain structure and
it's not a published structure so it's
it's sort of like an undocumented
feature is that you could go in you can
go in and scrape a site
but as soon as they change something you
you're gonna have to update your your
scrapers
you know another one that is not used as
often but
Falls more into a category like a an API
is using RSS feeds
in this case you're going to get really
good structured data much like an API
it's
the RSS feeds tend to be give you less
parameters but at least we'll give you
data you know some sort of feed of data
for a time frame or maybe out to a
certain you know a limited number of
results it's really easy to call so you
could take for example
if I go out to the developer North site
and I think if I just do
I think if I just do RSS
uh wait I don't want to do it here
because it's not going to display right
let me go do it in
uh if I go do it and which I don't have
up
let me jump over here real quick and do
foreign
I think I have Google Chrome yes
so if I go here and I do developpreneur
if I can spell it right
then I'm going to get back something
like this so it's going to be XML which
makes it nice and then you're going to
be able to see that it's got some
formats here so like for this RSS feed I
can see what the channel is I can see if
there's an image involved I can always
store that I can go into the item itself
and I can see stuff like my title I can
see a link to it who's the creator
what's the publishing date I've got
category Fields I've got a uh whatever
this is basically a disc oh here we go
my description
comments I mean all of this is XML
formatted so I can take that and
although like you saw I didn't have
I had no parameters so this is just
bringing data back however long it
brings it back
um in this case since it's WordPress it
it's set how many articles back you can
get
but these are nice and formatted and
it's XML and so we've got a lot of ways
to to parse our way through XML and pull
that information out and suck it into
our database so again very similar to an
API apis off apis May return XML format
as well
uh although you know sometimes it may be
easier to use Json they will because a
lot of times you can get that kind of
format
but in general RSS feeds are far easier
to deal with than scraping itself you
know actual web page scraping and
therefore can be a very powerful tool to
use for your scraping efforts
so let's talk about some tips tricks and
trips basically trip ups that you may uh
run into sort of like
some helpful hints if you're going to
get into a scraping project
first thing to do is start simple and
validate your calls whether this is a an
API call whether it is scraping a site
with an API call or even an RSS feed
it with an API call maybe put as tight a
parameter
for the query as possible you know if
you can narrow it down to an API call
that is by a single ID so that you know
it's a single record
that's what you want to start with make
sure in the simplest case in a web
scraping it would be like a single page
I want to go to that page and I want to
just be able to pull that data or maybe
pull that data and then navigate to the
detailed page and pull the data from
there
going through that process of
figuring out exactly what you want to
get verifying that you can get it
building out the mappings and storing
that data and making sure that it
maintains its you know Integrity
essentially is always going to be that
first step and it's going to be so much
easier to do that when you do it with a
small result set than trying to spin
through you know huge numbers of Records
uh one because
it takes a while to go through those
records so your cycle time is going to
be pretty big
but also because it's easy it's easy to
get lost in dealing with a whole bunch
of different things and special cases
where it's better to pick one get that
working and then expand it out to Future
records and then in that case if you can
see if you see stuff where there are
special cases or some places where
fields are missing or things like that
then you can address that accordingly on
a on a case-by-case basis to allow it to
support a wider range of results
you're going to want to look for hidden
fields and by that
um those are going to be things that
maybe you know don't particularly on a
website fields that are are a type of
hidden you know things that are not
necessarily obvious in an API they may
be some some things that are maybe even
undocumented you know maybe you see the
result sets that the documentation has
but when you actually pull the API
results you see there are additional
fields
a lot of times these are going to be
things like ID internal IDs and things
of that nature that are going to be
needed in order for you to you know
maybe to properly link to another web
page or to make a subsequent call that
gets you more detail about the data that
you're pulling
since we usually go with multiple data
sources
it is very helpful to use a you know a
pattern where you have intermediary type
classes or structures to store stuff in
and then those things do the save and
update to your data store whatever that
happens today
so then what you end up doing
is your first time through
you have to build the whole thing you
build the parsers or the you know the
scraper the parser and then you build
the data structure and then you have to
build the you know the insert update
delete probably the crud related
functions so that those data stores can
be the data in those data stores can be
pushed into your database or in those
you know data structures can be pushed
in but then the bonus is the next time
around if you do that right then all you
have to do is build the parser mapper
into your you know General structure and
then everything behind that works
because you know that all your inserts
updates deletes all work as long as the
data is valid
and so you just need to do
um and the next time through you just
need to do your parsing and then since
you've already done even probably a lot
of the validations within those classes
then you you parse or scrape put that
data into class and then run your
validation steps and then you know voila
you know whether you have a valid
um set of data that you've pulled
and with that
you're probably going to want to do
either in those classes or it's sort of
like a you know utility functions
you're going to find that formatting
type helpers are very very valuable do
not assume particularly against across
multiple data sources that stuff's
always going to have the same format
phone numbers dates phone numbers and
dates in particular are probably the the
most common things you'll run into one
site may be a phone number is just 10
straight numbers another one maybe it
always has you know hyphens built into
there some of them may be formatted with
like little parentheses and such
so
you're going to want to have something
that takes data that kind of data those
individual data items
and does some sort of a conversion to a
standard format that you're going to
store like I said dates are huge for
that there's a lot of times that dates
will be all over the map as far as what
you know style and format they use and a
lot of them because they're going to be
localized
and it may not be an actual date time
field where you can just you know do
your conversion real easy particularly
uh web scraping stuff
there's all kinds of different ways you
may see it displayed
so what you're going to want to do is
have some way to like a helper have
these things that take a string in and
then can do some formatting and then
spit it out in a consistent format
that's the one that you're going to use
for your data storage
another thing is like I said date and
time is going to vary quite a bit
as far as their format and and money as
well currency because it you know
currency is different from maybe even
country to Country you're gonna have
different types of currencies and of
course you're going to have different uh
symbols around that and and things of
that nature you may even have different
um scale of numbers you know you may
have something that where a million of
something is never heard of in the AMA
of another where everyday transactions
are a million of that currency
but with those
um the that's again where you're going
to want to have those formatting helpers
are going to be huge because there's
going to be so many different ways you
can see it that you're going to want to
be able to sort of cut through the the
formatting things that may be on a
display and get to the actual data items
that you need so that you can convert
those into you know into your system
now dates and times in particular
are can be a challenge when you are
pulling from different data sources if
dates and times need to match up with
data from other data sources because
then you may run into situations where
you need to know what time zone it's in
um and you
probably are going to have to convert to
for Tates dates times uh money in
currencies you're probably going to want
to convert to sort of a universal
um you know time zone or Universal
monetary unit and then be able to
convert back out if you need to so like
for money and maybe it's you hit sites
that use euros and ruples and
um and again and a whole bunch of
different stuff but you decide what I'm
going to do is convert that to a dollar
amount store that is you know US dollars
in my database and then when I kick
stuff back out people convert or my my
site that you know my features that
provide those monetary values will
convert it to whatever monetary system
the user wants
there's a lot of work that can go into
those but once you have them they can be
pretty valuable and actually the nice
thing with a lot of those formatting
helpers is that they are and even some
of those structures they're going to
actually be sort of portable from from
Project to project uh a lot of the
validators particularly because phone uh
addresses you know things like zip codes
and that can be very different those are
going to show up an app app or app after
app so once you've got those built in
those are that's a nice code to be able
to carry with you to your next project
um sizing and outliers this is
this is like probably the biggest
challenge of scraping a lot of the times
um but also just figuring out how to
take
multi-data sources and combine them into
something that's uh sort of a you know a
single data source which is your store
because you may have values that are
typically
a certain length you know I think of a
name
if you have a last name
probably
I don't know maybe 20 characters is
typically typically going to be the
limit you know maybe 30.
but then you may have somebody that's
got some really convoluted hyphenated
you know it's like
their last name is like Jones hyphen
Brown hyphen Smith hyphen you know
McGillicuddy you know and it's just this
big long thing and so you need to figure
out how do we handle that do we need to
you know have some incredibly huge long
data size to be able to store what we
think is anything or you know do we have
to have some way to be able to
either like encode that so it gets in
you know it gets cut down to a smaller
value or
um you know maybe truncated or something
along those lines and then you have to
look at outliers as well you're going to
find situations where 99 999 times out
of a thousand the data looks like this
but every so often it looks a little
different
and that may be due to special
characters which will you know see a
little bit as well what those things
look at but it may be things like uh
particularly like web scraping that
maybe it the data is almost always on
one line except for the sometimes when
it goes to a second line
you know those kinds of outliers are
things that you're going to have to sort
of keep an eye out for when you're
scraping you got to look at what happens
if there's a no or an empty value
because sometimes data just doesn't get
entered and it can be inconsistent so
you may have a like an address that's
typically address line city state ZIP
and you'll have data in there that's got
address lines City no state but a zip or
maybe city state but no zip or maybe no
City but a state is it you know or maybe
no address line in just a zip it's
you've got to be aware that there's
going to be essentially gaps in the data
and then how do you handle that you know
how do you want to handle that in those
cases because sometimes that's critical
sometimes that data is very important
for you to be able to really uh to
utilize it
uh for example like let's go back to
that job data let's say that part of
your what you do is you're you know
giving a list of
Health Care related jobs by City
if you pull data that doesn't tell you
what city it's the jobs in then
that's going to be it's almost useless
to you because you don't know what city
to put it in
um but in the same situation what
happens that they list it in four
different cities you want to make sure
that that you know maybe that gets
replicated to you know four records one
for each City but then what happens if
somebody searches you know two or three
different cities
do you want to deal with uh seeing that
same job show up twice or more often
foreign

Transcript Segments

0.359

foreign

7.28

[Music]

27.84

so you're probably going to have to do

29.34

some you know some additional posts find

31.38

links click on those essentially post

33.66

that get the data back

36.12

and then you may even need to do things

38.64

like you know trigger some searches or

40.62

filters or even do delays so I may want

43.26

44.94

um let's say well let me just do like a

46.8

Best Buy

48.239

it's like we've seen that before so I

50.219

could go out to Best Buy

53.039

and I think this is actually one example

54.719

we used in the past so let's say I want

56.579

to look for like PS5 consoles so I

60.36

what I'm doing here is why we need the

63.619

application to do so it would say Okay

65.7

trigger the search and then it's going

68.4

to have to walk through all of the you

71.22

know result items that it wants and then

73.92

I'm going to have to scrape those grab

75.72

the data and

78.119

um you know kick it out into my

80.759

uh into my database or data store

85.68

so that's scraping it can be I said it

88.259

can be time consuming it can and it can

90.36

be very fragile because if they change

91.92

their site layout and format then it

95.4

basically is going to break the scraping

96.9

piece because you're assuming that

99.659

things follow a certain structure and

101.52

it's not a published structure so it's

104.22

it's sort of like an undocumented

105.9

feature is that you could go in you can

108

go in and scrape a site

110.1

but as soon as they change something you

112.619

you're gonna have to update your your

114.06

scrapers

115.979

you know another one that is not used as

119.759

often but

121.439

Falls more into a category like a an API

124.979

is using RSS feeds

127.619

in this case you're going to get really

129.599

good structured data much like an API

134.04

it's

135.239

the RSS feeds tend to be give you less

138.239

parameters but at least we'll give you

141.959

data you know some sort of feed of data

144.54

for a time frame or maybe out to a

147.9

certain you know a limited number of

149.28

results it's really easy to call so you

152.58

could take for example

155.099

if I go out to the developer North site

161

and I think if I just do

165.42

I think if I just do RSS

170.34

uh wait I don't want to do it here

172.08

because it's not going to display right

173.28

let me go do it in

176.519

uh if I go do it and which I don't have

180.36

181.44

let me jump over here real quick and do

185.64

foreign

187.76

I think I have Google Chrome yes

193.56

so if I go here and I do developpreneur

196.14

if I can spell it right

204.18

then I'm going to get back something

206.76

like this so it's going to be XML which

208.44

makes it nice and then you're going to

210.3

be able to see that it's got some

211.56

formats here so like for this RSS feed I

214.14

can see what the channel is I can see if

216.54

there's an image involved I can always

218.4

store that I can go into the item itself

220.26

and I can see stuff like my title I can

223.08

see a link to it who's the creator

225.06

what's the publishing date I've got

228.239

category Fields I've got a uh whatever

231.54

this is basically a disc oh here we go

233.22

my description

235.459

comments I mean all of this is XML

238.98

formatted so I can take that and

241.019

although like you saw I didn't have

243.54

I had no parameters so this is just

245.819

bringing data back however long it

247.92

brings it back

249.959

um in this case since it's WordPress it

251.76

it's set how many articles back you can

254.459

get

255.18

but these are nice and formatted and

258.359

it's XML and so we've got a lot of ways

261.299

to to parse our way through XML and pull

264.479

that information out and suck it into

265.979

our database so again very similar to an

268.139

API apis off apis May return XML format

272.58

as well

273.479

uh although you know sometimes it may be

275.52

easier to use Json they will because a

277.68

lot of times you can get that kind of

278.82

format

280.199

but in general RSS feeds are far easier

284.28

to deal with than scraping itself you

286.56

know actual web page scraping and

289.259

therefore can be a very powerful tool to

291.24

use for your scraping efforts

297.66

so let's talk about some tips tricks and

300.06

trips basically trip ups that you may uh

303.18

run into sort of like

305.04

some helpful hints if you're going to

306.9

get into a scraping project

310.62

first thing to do is start simple and

313.44

validate your calls whether this is a an

316.38

API call whether it is scraping a site

321.24

with an API call or even an RSS feed

324.479

it with an API call maybe put as tight a

327.96

parameter

329.12

for the query as possible you know if

331.919

you can narrow it down to an API call

334.74

that is by a single ID so that you know

337.8

it's a single record

339.9

that's what you want to start with make

342.419

sure in the simplest case in a web

345.419

scraping it would be like a single page

347.22

I want to go to that page and I want to

350.52

just be able to pull that data or maybe

353.039

pull that data and then navigate to the

356.1

detailed page and pull the data from

358.8

there

360.06

going through that process of

364.08

figuring out exactly what you want to

365.759

get verifying that you can get it

367.62

building out the mappings and storing

370.74

that data and making sure that it

372.78

maintains its you know Integrity

374.759

essentially is always going to be that

377.46

first step and it's going to be so much

378.96

easier to do that when you do it with a

381.78

small result set than trying to spin

383.46

through you know huge numbers of Records

386.039

uh one because

388.199

it takes a while to go through those

390.12

records so your cycle time is going to

391.919

be pretty big

392.94

but also because it's easy it's easy to

396.84

get lost in dealing with a whole bunch

398.759

of different things and special cases

402.18

where it's better to pick one get that

405.539

working and then expand it out to Future

408

records and then in that case if you can

410.22

see if you see stuff where there are

412.74

special cases or some places where

415.199

fields are missing or things like that

416.639

then you can address that accordingly on

419.699

a on a case-by-case basis to allow it to

422.3

support a wider range of results

426.66

you're going to want to look for hidden

429.419

fields and by that

432.479

um those are going to be things that

434.22

maybe you know don't particularly on a

436.38

website fields that are are a type of

440.88

hidden you know things that are not

443.88

necessarily obvious in an API they may

447.3

be some some things that are maybe even

449.16

undocumented you know maybe you see the

451.08

result sets that the documentation has

453.66

but when you actually pull the API

455.34

results you see there are additional

457.199

fields

458.639

a lot of times these are going to be

461.699

things like ID internal IDs and things

464.699

of that nature that are going to be

465.96

needed in order for you to you know

468.84

maybe to properly link to another web

471.06

page or to make a subsequent call that

476.099

gets you more detail about the data that

478.62

you're pulling

481.74

since we usually go with multiple data

484.74

sources

486.24

it is very helpful to use a you know a

489.36

pattern where you have intermediary type

492.78

classes or structures to store stuff in

495.479

and then those things do the save and

498.84

update to your data store whatever that

501.66

happens today

503.099

so then what you end up doing

505.08

is your first time through

507.419

you have to build the whole thing you

509.28

build the parsers or the you know the

511.5

scraper the parser and then you build

513.839

the data structure and then you have to

517.2

build the you know the insert update

519

delete probably the crud related

520.5

functions so that those data stores can

523.68

be the data in those data stores can be

525.72

pushed into your database or in those

527.82

you know data structures can be pushed

529.44

in but then the bonus is the next time

531.66

around if you do that right then all you

534.12

have to do is build the parser mapper

535.86

into your you know General structure and

539.82

then everything behind that works

541.26

because you know that all your inserts

543.12

updates deletes all work as long as the

545.82

data is valid

547.32

and so you just need to do

550.08

um and the next time through you just

551.519

need to do your parsing and then since

554.04

you've already done even probably a lot

555.839

of the validations within those classes

557.76

then you you parse or scrape put that

561

data into class and then run your

562.68

validation steps and then you know voila

564.899

you know whether you have a valid

568.92

um set of data that you've pulled

572.399

and with that

574.32

you're probably going to want to do

576.3

either in those classes or it's sort of

579.36

like a you know utility functions

582.3

you're going to find that formatting

585

type helpers are very very valuable do

588.06

not assume particularly against across

590.82

multiple data sources that stuff's

592.68

always going to have the same format

594.48

phone numbers dates phone numbers and

598.26

dates in particular are probably the the

600.24

most common things you'll run into one

602.94

site may be a phone number is just 10

605.94

straight numbers another one maybe it

608.279

always has you know hyphens built into

610.32

there some of them may be formatted with

612.3

like little parentheses and such

614.519

616.08

you're going to want to have something

617.459

that takes data that kind of data those

620.399

individual data items

622.44

and does some sort of a conversion to a

625.5

standard format that you're going to

626.94

store like I said dates are huge for

629.339

that there's a lot of times that dates

631.44

will be all over the map as far as what

635.1

you know style and format they use and a

638.04

lot of them because they're going to be

638.94

localized

640.08

and it may not be an actual date time

642.66

field where you can just you know do

644.7

your conversion real easy particularly

647.22

uh web scraping stuff

649.92

there's all kinds of different ways you

651.779

may see it displayed

653.399

so what you're going to want to do is

654.779

have some way to like a helper have

657.48

these things that take a string in and

659.94

then can do some formatting and then

661.86

spit it out in a consistent format

664.68

that's the one that you're going to use

665.899

for your data storage

670.8

another thing is like I said date and

673.019

time is going to vary quite a bit

676.26

as far as their format and and money as

678.959

well currency because it you know

682.2

currency is different from maybe even

683.94

country to Country you're gonna have

685.26

different types of currencies and of

686.82

course you're going to have different uh

688.92

symbols around that and and things of

691.38

that nature you may even have different

694.14

um scale of numbers you know you may

696.48

have something that where a million of

699.42

something is never heard of in the AMA

702.12

of another where everyday transactions

703.92

are a million of that currency

707.519

but with those

709.74

um the that's again where you're going

712.32

to want to have those formatting helpers

714

are going to be huge because there's

716.04

going to be so many different ways you

717.48

can see it that you're going to want to

719.279

be able to sort of cut through the the

722.279

formatting things that may be on a

724.2

display and get to the actual data items

727.2

that you need so that you can convert

729.36

those into you know into your system

732.54

now dates and times in particular

736.26

are can be a challenge when you are

739.74

pulling from different data sources if

742.2

dates and times need to match up with

744.36

data from other data sources because

746.88

then you may run into situations where

748.68

you need to know what time zone it's in

751.38

um and you

752.579

probably are going to have to convert to

756.06

for Tates dates times uh money in

760.079

currencies you're probably going to want

761.399

to convert to sort of a universal

766.019

um you know time zone or Universal

768.92

monetary unit and then be able to

772.019

convert back out if you need to so like

774.72

for money and maybe it's you hit sites

777

that use euros and ruples and

781.74

um and again and a whole bunch of

783.839

different stuff but you decide what I'm

786.779

going to do is convert that to a dollar

788.94

amount store that is you know US dollars

791.88

in my database and then when I kick

794.88

stuff back out people convert or my my

797.94

site that you know my features that

801.139

provide those monetary values will

804.06

convert it to whatever monetary system

806.7

the user wants

809.399

there's a lot of work that can go into

811.2

those but once you have them they can be

813.36

pretty valuable and actually the nice

814.98

thing with a lot of those formatting

816.72

helpers is that they are and even some

819.3

of those structures they're going to

820.32

actually be sort of portable from from

821.88

Project to project uh a lot of the

824.339

validators particularly because phone uh

828.18

addresses you know things like zip codes

829.68

and that can be very different those are

832.5

going to show up an app app or app after

834.36

app so once you've got those built in

836.339

those are that's a nice code to be able

838.44

to carry with you to your next project

844.019

um sizing and outliers this is

848

this is like probably the biggest

850.079

challenge of scraping a lot of the times

854.399

um but also just figuring out how to

856.38

take

857.3

multi-data sources and combine them into

860.459

something that's uh sort of a you know a

862.68

single data source which is your store

865.86

because you may have values that are

868.44

typically

870

a certain length you know I think of a

871.92

name

872.76

if you have a last name

875.7

probably

877.8

I don't know maybe 20 characters is

879.839

typically typically going to be the

881.459

limit you know maybe 30.

883.62

but then you may have somebody that's

885.18

got some really convoluted hyphenated

890.1

you know it's like

891.899

their last name is like Jones hyphen

894.18

Brown hyphen Smith hyphen you know

897.56

McGillicuddy you know and it's just this

900.3

big long thing and so you need to figure

903

out how do we handle that do we need to

906

you know have some incredibly huge long

908.76

data size to be able to store what we

912.12

think is anything or you know do we have

915.36

to have some way to be able to

917.639

either like encode that so it gets in

920.579

you know it gets cut down to a smaller

922.38

value or

924.6

um you know maybe truncated or something

926.699

along those lines and then you have to

928.68

look at outliers as well you're going to

930.36

find situations where 99 999 times out

934.74

of a thousand the data looks like this

937.44

but every so often it looks a little

939.6

different

940.56

and that may be due to special

943.32

characters which will you know see a

945.24

little bit as well what those things

946.62

look at but it may be things like uh

948.779

particularly like web scraping that

950.88

maybe it the data is almost always on

953.399

one line except for the sometimes when

955.5

it goes to a second line

957.72

you know those kinds of outliers are

959.519

things that you're going to have to sort

960.66

of keep an eye out for when you're

962.1

scraping you got to look at what happens

964.56

if there's a no or an empty value

966.3

because sometimes data just doesn't get

968.699

entered and it can be inconsistent so

970.74

you may have a like an address that's

973.199

typically address line city state ZIP

976.62

and you'll have data in there that's got

978.72

address lines City no state but a zip or

982.8

maybe city state but no zip or maybe no

984.899

City but a state is it you know or maybe

986.94

no address line in just a zip it's

989.88

you've got to be aware that there's

992.279

going to be essentially gaps in the data

995.519

and then how do you handle that you know

997.8

how do you want to handle that in those

998.82

cases because sometimes that's critical

1000.68

sometimes that data is very important

1003.019

for you to be able to really uh to

1005.54

utilize it

1006.8

uh for example like let's go back to

1008.54

that job data let's say that part of

1011.06

your what you do is you're you know

1013.399

giving a list of

1015.139

Health Care related jobs by City

1019.16

if you pull data that doesn't tell you

1021.44

what city it's the jobs in then

1025.699

that's going to be it's almost useless

1028.459

to you because you don't know what city

1029.72

to put it in

1031.4

um but in the same situation what

1033.14

happens that they list it in four

1035

different cities you want to make sure

1036.5

that that you know maybe that gets

1037.88

replicated to you know four records one

1040.64

for each City but then what happens if

1043.579

somebody searches you know two or three

1045.559

different cities

1047.24

do you want to deal with uh seeing that

1049.52

same job show up twice or more often

1064.7

foreign