📺 Develpreneur YouTube Episode

Video + transcript

Navigating Data Integration: Scraping Vs. APIs

2024-04-25 •Youtube

Detailed Notes

In the latest Develpreneur Podcast episode, hosts Rob and Michael explore data integration methods. Focus on scraping versus using APIs. They have experience in both realms. Dissect the challenges and advantages of each approach. Offer valuable insights for developers and data enthusiasts.

Using Scraping for Data Integration

What is scrapping?

Scraping involves programmatically extracting data from web pages, mimicking human interaction with the user interface. Today, web scraping involves navigating HTML structures, identifying elements by their IDs, and extracting relevant information.

Inconsistent IDs and Embedded Content

Scraping challenges arise when pages lack consistent IDs or contain embedded content like iframes. On the other hand, APIs provide a structured means of accessing data, offering clear endpoints and formatted responses, typically in JSON or XML.

Streamlining Scraping with Selenium IDE

Rob underscores the importance of developers incorporating IDs into web page elements for easier scraping. He recommends using Scrapy and Selenium IDE. These are useful tools for scrapping interactions, which provide valuable insights into a page's scrapeability.

Using APIs for Data Integration

What are Apis?

An API is a set of rules for software communication. It defines methods and data formats for requesting and exchanging information. APIs enable seamless integration between systems. They provide structured data access, clear endpoints, and formatted responses. Unlike scraping, APIs follow contractual agreements. This simplifies data retrieval and ensures consistency.

Controlled Access and Security

Michael highlights the advantages of APIs, emphasizing their controlled access and security features. Unlike scraping, which can be hindered by page changes and inconsistencies, APIs offer a reliable and secure way to access data, with built-in authentication and authorization mechanisms.

Simplifying Data Retrieval

API contracts define the expected behavior and data format for interacting with an API, making it easier for developers to integrate and consume data. By adhering to these contracts, developers can streamline the data retrieval process and avoid potential errors.

Understanding Endpoints and Parameters

Rob and Michael stress the importance of thoroughly understanding API documentation, which outlines endpoints, request parameters, authentication methods, and response formats. Clear documentation enables developers to effectively use APIs and integrate data into their applications.

Exploring Alternative Data Source

The Significance of RSS Feeds

An RSS feed publishes frequently updated content. It uses the Really Simple Syndication format. Blog posts, news, and podcasts get published via RSS. Users subscribe to the website's RSS feed. New entries get aggregated into a single feed. Feed readers, browsers access the RSS feed.

RSS Feeds contain a lot of relevant information

RSS feeds offer easily parsed XML documents, simplifying data extraction compared to scraping or API integration. These feeds include metadata, content summaries, and links, enabling users to stay updated on preferred websites effortlessly.

In conclusion, Rob and Michael recommend exploring scraping, API methods, and RSS feeds. Consider using tools like Scrapy and Selenium for scraping. Also, familiarize yourself with various APIs for data retrieval. These tips will provide you with a solid knowledge of scraping, APIs, and RSS feeds so developers can navigate data integration confidently and efficiently.

Feedback and questions are welcome at [email protected], and listeners are invited to connect with Develpreneur on YouTube for more insights and discussions. By focusing on mastering data integration, developers can unlock new possibilities and streamline their workflows.

Additional Resources * Restful API Testing With RestAssured - https://develpreneur.com/wp-content/uploads/2022/02/Restful-API-Testing-With-RestAssured.jpg * Restful API Testing Using RestAssured - https://develpreneur.com/restful-api-testing-using-restassured/ * RSS Reader And Promoter – A Product Walk Through - https://develpreneur.com/rss-reader-and-promoter-a-product-walk-through/ * Scrapy - https://scrapy.org/ * Selenium - https://www.selenium.dev/selenium-ide/

Transcript Text
[Music]
similar thing um we've talked about this
I think we did this one time at a mentor
class way back I think there was a
presentation that was and it's basically
because I had a conversation today with
somebody that brought it up again I was
thinking this may be a good good topic
to cover is talking about in
particularly scraping versus API now
this isn't as much of a probably not as
much a technical thing although I guess
it is some extent because I I I think
too often that particularly when you're
getting started you'll you know when
you're not when you haven't done a lot
as a
developer you'll hear you know somebody
will say hey we need to scrape this data
and then you just like okay well it's on
a website we got to go to that website
we've got to find a way to crawl it and
pull that because because that's a cool
fun challenging thing to do or because I
just did that in my you know my software
Class A couple years ago so I sort of
know how to do it and it is one of those
things like once you know how to do it
it's like cool I can do it it's slow
it's tedious it's painful depending on
how well they use IDs and whether you
have to use CSS selectors or xath and
all that kind of crud but it's also so
fragile and even with um there's some AI
projects that are working on that to try
to make that less fragile but even with
that apis are so much better and there's
just like even the idea of like
importing and exporting data like just
file uploads and you know that kind of
stuff that's they are ways to integrate
and to move data from system to system
that I think people don't think about as
much and so that's what I want to talk a
bit about it's it's really going to be
more like we talk about apis and some of
the the the pros and cons of those
versus like just go through a scraping
thing and stuff like that um I see is
that sound of like seem like something
You' got some you have some thoughts on
as well yeah
because incidentally that's also the
problem with the web automation with the
web testing because essentially having
to scrape to find the ID is to in uh
basically to interact with the using the
web driver with the pages and one of the
arguments I have not necessarily about
screen scraping but about website design
in general is good website development
requires IDs on your any input fields or
any interactive action fields that you
have on a page if you're not doing that
you're essentially writing bad code and
I don't care how much AI you can put out
there you're never going to fix the
problem you have to write the code
correctly or it's just not going to work
yeah so I think that's and that's
exactly the place one of the things we
want to talk about is because that's
that's part of why scraping is such a
pain in the butt because most code is
not written that way sometimes
intentionally but I think we can talk a
little bit we can dig a little bit into
the
whole we get a little nerdy on that
about getting into a control and how do
you how do you navigate to that and pull
that information back out
so hello and welcome back we are here
for another episode of building better
developers also known as the developer
or podcast actually I think it was it
was developer or first and then it
became building better developers but it
doesn't that's like neither here nor
there I'm Rob one of the founders of
develop and or and across the uh the
digiverse or whatever it is is Michael
welcome and I want you introduce
yourself as well
hey everyone my name is Michael molash
co-founder of develop Nur and founder of
Envision
QA today this episode we are going to
talk about we really it's at a high
level we're going to talk about
integration but we're going to take a
step down and talk about like really the
differences between scraping versus
using apis or other methods there are
many ways that you may need to or want
to ingest data into a system and I want
to talk a little bit about those because
there is I too often they sort of just
there's like a broad brush approach and
either everybody thinks that you're just
at the most simple of like all we can do
is import you know we have to have a CSV
and import that and that's the only way
we get data in or the other end of the
only way we can do this is have this
really complex web scraper that goes out
and crawls all these sites and grabs all
this information and then of course as
soon as any that information changes on
the site it's broken and we have to
rewrite the code
neither of those are 100% correct
although they might be for you it
depends on where you're
at so I think what I want to start with
is the let's start on the The Far Side
of that is the challenges of scraping
and in particular uh this is something
that's although it is near and dear to
my heart because I've had to do this
with several projects it's maybe even
nearer in Deer or however it is to
Michaels because this something that he
does as part of his
code generation tool and so I think
we'll start with that is what are the
let's start with what let's talk a
little bit what makes a good page for
scraping for getting information back
off of that page from a program
programmatic scent not we're not talking
user experience stuff but we're talking
the backend
side so I guess one additional thing
maybe you can touch on too is you know
why what is web scraping why why do we
use it before I get into that because I
want to make sure listeners understand
the difference between web scraping and
interacting with
apis all right I will take that volley
back and I'll work with that one so
scraping I have heard people that have
actually I've come across people that
think all of these things are the same
so for our purposes we're going to talk
about scraping is and I think it's
probably the technically correct way
that you look at it scraping is when you
go to either a a it's basically go to a
user interface and you from a program
are trying to do the same thing a hum
would do it actually comes all the way
it goes back to the old um main frames
and they would screen scrape what they
would do is they would have something
that would pull the display basically
back and it would go you know count out
15 columns over here and three rows down
and grab that value and then go six
columns over here and two rows up from
that and then grab that value so it's
literally like looking at your screen in
that case the grid that it was because
it was a i it's usually like a 40 by 25
grid and it was like where do you go on
it to get this specific value and then
sometimes it would be you know you grab
three blocks in a row to get a value
that's a string that you know is you
know three characters in length that has
advanced into the web world of go to any
page the easiest way to like to to get a
feel for it is to go to any web page
with like pretty much any browser these
days rightclick and then hit inspect and
you're going to end up getting somewhere
there you're going to get an option to
like pull out a JavaScript or a view
Source kind of thing and if you this is
just if you don't know you know if you
don't know HTML on that what you're
going to see is you're going to see what
is a essentially a formatted document
there's all of these little tags and
there's these ways that you build out
that page behind the scenes so if think
of that page is all these little
controls or these little widgets however
you best organize them the goal with
scraping is like for example if I'm
going to a an input form that has first
name last name email phone number and I
want to go grab the phone number then I
can go look at that document format and
I know that if I go to the you know the
email input field I can grab that that
value off of it or vice versa I can put
a value into that
field and how you get to
that is in itself a little bit of a
journey because there are multiple ways
to do that there are multiple ways to
tackle the the format and how we
navigate our way into those pages but
that's the scraping side API side is and
I I I figured I had like an example
today that I think probably works the
best is that if you want to go look up
information about let's say Michael you
can go to his website you can go find
his about page and then you can go look
for stuff that says his name his email
phone number all that kind of stuff and
you can go find that and then you
personally would have to go look at it
find it on the screen and then write it
down somewhere or if you know him enough
you could just text him something and
say hey I need your your phone number
your email address and he will give you
back in a way that you can easily read
quickly you know his email and his phone
same thing people do that all the times
with email you send somebody says hey I
need your address they'll give you a
nice formatted little ADD address thing
that's sort of what an a an API is is
that instead of you having to go look
through all this stuff you say hey this
is the stuff I want this is a data I
want and the API gives that back to you
in a really nice format usually it's
either depending on how you do it it's
usually going to be Json or XML or some
different things like that depending on
what you're doing but the the bonus is
it's well formatted and it is it's not
guesswork if you're scraping you're
trying to figure out is this B data even
available in an API it is a there is a
contract there of I'm going to give you
this information and you're going to
give me back these fields and the data
that match those now I think I got a
little long winded so sorry about that
so I'll throw you back to you Michel and
let's see if see if you can do and it
say anything in a shorter method than I
just did no that was really good and
thank you for doing that because I I
just wanted to make sure that we were
kind of on the Baseline one other thing
to kind of mentioned here because if
you're dealing with apis like Rob said
you're kind of dealing with the contract
you're dealing with a controlled way of
passing data from your system to a a
requestor a user someone trying to get
information from you screen scraping is
essentially going out to any site and
just basically trying to take the data
that is being displayed on a web
page and in many cases
not many companies want you to go do
that uh because they want their data to
be protected where which is where apis
are a little more powerful they're a
little more controlled and you can
actually put um restrictions and
security on top of them to only allow
certain people access to your data
whereas screen scraping if it's on the
web you can get it uh which is where a
lot of your AI tools today are out there
just kind of reading a whole bunch of
data from so many different pages
they're essentially scraping those pages
from a developer's perspective if I'm
building a site for screen scraping the
site needs to follow basic web page or
Web building techniques every element on
your page that contains data that is
necessary for consumption should have an
ID maybe a name depending upon which uh
you know if you're using uh Json or
whatever your tool is but essentially ID
is the universal ID it has to be unique
for all your elements on your page you
should never have two duplicate
IDs so if your developers are following
best practices and building pages with
IDs then you should have no problem you
should be able to just say hey go read
the Dom read pull down basically or
download the web page run a script to
find all the IDS on the page boom there
you go you can go grab all the data off
the page real simple similar to an API
but an API you actually make a request
you get back Json XML which is very easy
to parse um but like I said if you do
the API and it returns XML if you're
doing a screen scrape and your page has
good IDs you essentially could do the
same thing it's essentially you're
reading an XML you're reading a Dom
document and the two could almost be
identical if it's done correctly one of
the biggest problems we have especially
from a testing perspective is not a lot
of developers are building web pages
with IDs in mind they're not thinking
about testers they're not thinking about
web scrapers they're just trying to get
the code after as quickly as possible
Sometimes using tools would drag and
drop which those are great but those
typically will generate a random ID for
the page now every time the page gets
rebuilt all those IDs change they all
randomly change so if you're doing that
well yeah your page has IDs but now oh
next time I try to run a scraper it all
fails because none of the IDS are the
same so these are some of the problems
and restrictions we have to look at when
building Pages we have to think how is
this data going to be consumed is it
just someone interacting with the web
page is it do I need to make sure this
is tested do I have a more complex
interface where people have to go in and
actually use this like um like a patient
portal or you know an inventory
management system those things should
have IDs because one if you're consuming
data if you're sending data back to a
system to be saved you need to have some
identifier some unique identifier and
that's how we send information back and
forth so we know that oh hey I got this
field this field is this data stored in
this location if you don't do that how
do you know what data you're
saving and that's I think that gets us
into the one of the big differences
between the two when you think about it
is if you from an API you can say hey I
want let's say I want a customer data I
can go in I can ask I can say here's a
customer ID give me the information back
and it'll say oh for customer ID one
here's their name here's their address
blah blah blah if you're doing the
scrape particularly if you if you've got
these random IDs and some of these
things or no IDs then what you're doing
is you're going to say okay I need to
First find a way to get to the search
page that and then figure out how do I
search for that customer and then when I
get to that customer then I have to go
find somewhere in the document where is
the name field where is the address
field and if I have IDs then it's really
just a matter of you just go find the
control with that ID and you go oh the
name the name field is ID 46 whatever it
is and here's a value and you can pull
it
out if you don't have that you end up
having to walk the Dom basically so you
say well if I go to the customer search
result page page and I can go to the
customer detail then I know that if I
get three divs down then I'm into the
customer information and then within
that one there's two divs over and now
I'm into the address information and
within that I can go into there's a you
know maybe there's an href in there so I
can grab a link and then I can do that
and so you're having to like you're
having to walk the Dom just like you
know think about indexing anything you
either can walk your way all the way
through like if you want to search for a
record in database you can look at each
record go is it is it is it is it or if
you have an ID you can jump right to
that row and that's what IDs will give
you and so if you want something that is
crawlable that is scrapable I guess
they're always going to be crawlable
which let me talk about that for a
second crawling is really just baby step
scraping instead of getting into the
details and trying to pull data out all
you're doing is going into that document
and try to find anything in there that
would link you to to another page and
then go follow that link now you may say
well that's a piece of cake it's always
an hre no it is not it could be an hre
it could be JavaScript it could be a
whole bunch of other things it could be
and it could be behind the scenes it
could be something where it's actually a
call back to the server and the server
sends something back so there's a lot of
different ways that that that can be
handled which is why sometimes that's
done on purpose so that you instead
you're driven to the API side and then
they say hey we're going to charge you x
amount of money so you can subscribe to
the API and then you can get going if
you haven't looked at these I would or
if this is new to you I would go look at
scrapey S C A py and I think it's
scrape.on I think there's a site with it
as well uh that they have and it may be
I forget what the other one they called
but essentially it's just like so you
can go do some scraping it'll help you
sort of drag and drop but there's also
some python behind it so that you can
actually go see what it looks like the
other thing I would do is pick just
about anything go search for API for
whatever your favorite tool is you know
whether you're uh I don't know if like
you use HubSpot or if you use Google go
look at the Google apis or Amazon's apis
there's pretty much everybody out there
there are apis some are free and some
are going to be the much better way to
deal with it and then some are not
because they may still be a better way
to deal with it because you don't have
to deal with all the headaches of that
we've just talked about but then you
also have to pay them for it because
they're going to say hey we're going to
make it easier for you we're going to
make make it nice for you but hey you
know share the love and so you're going
to pay us whatever you need to pay
thoughts on
those so those are really good examples
now there's one Edge case neither of us
have touched on yet and that is
embedded code or like iframes or
essentially a page within a page if
you're trying to do scraping that won't
work if you're trying to stream the page
and read the data through a stream the
only way I've gotten around this is to
actually physically uh download the
complete contents of the page through
like Firefox or Chrome with a plugin uh
but only certain browsers even give you
the functionality to get all that embed
code with your source code of the page
when you try to export it so that's
another limitation with scraping is if
you're trying to scrape pages that are
actually loading data from other sites
through an like an embedded uh plugin
you might not be able to get it you
might come into it and it's like oh it's
not there you can't find it uh and
that's just something else to think
about as you're going through this uh
one other thing to think about too
is if you're doing from from the testing
side or even if you're doing it from a
scraping side uh look at selenium IDE
it's a really cool free tool it's got
plugins for just about every browser
it's got a good desktop uh little
application and What it lets you do is
it basically you open up your browser
open up selenium and you literally click
through the page and as you click it
records what you're doing and you can
see if the elements you're working on on
your page have IDs how is selenium
seeing it is essentially going to tell
you how hard is this project going to be
to go scrape this page and a bonus to
that is it's lend will while you you can
export that in various languages so if
you want to scrape if you want to have
like the just a Brute Force scrape of
data you can actually use selenium have
it walk through all the steps generate
that code out and it will actually do it
for you you can just go run that in
whatever your language is it it's got
several that it does so let's say you
know PHP or python or whatever and it
will you can watch it open the browser
up walk through all of that stuff do
what it needs to do and then you know
close the browser hopefully close the
browser out otherwise hit close browser
afterwards um but selenium is awesome
selenium really is if you're if you're
curious if you're wondering if this is
something you can do then selenium is
one that say that is probably the the
fastest way to be able to get a page
open it get data off of it and and be
able to do that in a repetitive fashion
now there are some you know there's some
gotas to it and things like that but
that really is like a excellent starting
point for you and that's where you're
going to see it all over the place
anywhere that's anybody that's dealing
with scraping in particular they're
almost always going to mention Python
and they're almost always going to
mention selenium and selenium more so
more so than anything else because that
really is sort of the the industry
standard I think for for like robotic
behavior on a web page final
thoughts yeah all I would like to say is
for those of you getting into scraping
if you haven't really done scraping
before um like Rob said check out both
the selenium scrap SC Scrappy Scrapy um
and play with them uh it there's heck go
to developing ner you know throw up uh
sending ID or scrapey try to scrape our
page some oh one thing we didn't touch
on though is the other thing is some
pages even offer RSS feeds which
essentially are another way to scrape a
page that we haven't really touched on
but it is yet another source it's
essentially another feeder that you can
get like an API but it's public so you
essentially hit a page here's all their
data that you can digest and it's in a
clean manner that's funny because as
soon as you mention develop I like oh we
forgot to mention RSS feeds RSS feeds
just for those you guys that anybody
that hasn't you probably have heard of
RSS readers and such but you can go see
the source for an RSS feed and they tend
to be very well they're they're
formatted documents they are it's very
easy to crawl through them and get the
data you need there are I've worked with
many of those over the years um and used
those actually as opposed to a scraper
or even as opposed to an API because
then for example we talked about I think
it was like a couple episodes back now
we talked about
the uh and you guys have seen some of
that if you if you've looked on the
python side or actually I'm sorry it's
the spring boot side where I built the
um I think I talked about there anyways
built the built the upwork integrator
little app that goes out and grabs stuff
off of upwork well I do that from an RSS
feed because I can then take my specific
search and it allows you to crank that
out as an RSS feed and so then I can
just hit that RSS feed pull stuff down
and it's limited it's not going to give
me everything it gives me I think 30 40
50 records at a time at something but I
can hit that I can go through everything
it's it's XML so it's very easy to parse
very easy to pull out the pieces of
information I want and then I don't have
to deal with an API I don't have to
connect with that I don't have to deal
with scraping and it's pretty darn solid
I mean RSS feeds you tend to be pretty
stable and they're just going to be out
there so if you wanted to for example
have like be scraping the articles that
we have put out on the developing order
site you can go to and basically any
WordPress like go to the site RSS or
rss2 depending on how it is and then you
can see a nice XML document that you can
parse and do what you you know do
whatever you want to with it other than
obviously repurpose it unless you tell
us and then then we're cool about that
that being said I think we're going to
wrap this one up so as always shoot us
email at info@ developin word.com if you
have any questions if you have
suggestions if you have any anything
like that any feedback we'd love to hear
it you can check us out on YouTube at
develop and or and just go out there and
have yourself a great day a great week
and we will talk to you next
time all right thoughts that seem to go
pretty well too I thought yeah no I like
that in fact it gave me an idea and I'm
going to throw it out here so when I'm
listening to this again I'll remember to
do it uh I'm going to go ahead and throw
the links out there for uh
Scrapy um kind of like RS
there there's a plugin for that uh one
for slum IDE and then what I may even do
is I may even throw out a simple python
script to just uh screen scrape a page
yeah yeah if you need that I've got like
a thousand of those so so we do that we
really we we do that so often it's it's
it's sort of funny as it's now it's
become just a standard well that's
that's what the guy I was talking to
today it's a new customer and he's like
hey can we you know he wants some data
and it's like can we do that like yeah
people do that all the time literally
like we' have done that for all kinds of
different verticals and yeah it's just
we'll go grab it and yeah it's not if
you're scraping it that's the only thing
is it's not 100% it's that you'll get
some weird stuff even we didn't even get
into this we had something where we were
it was you pull down a PDF and you would
think pull down a PDF like XML or CSV or
anything else like just pull it down
scrape it and you're good PDFs even it
was was amazing it was obviously the
same uh generation engine it was a
report it was a nice little PDF report
that we were trying to find the data on
and it was it would periodically it
would just be totally off it was just we
could we finally got I think we got it
to like better than 98% hit rate but it
would be stuff where it was just
like it's like NOP it's they added
another column there and it's like there
shouldn't be but they just add another
column there and it would be because of
uh for example if somebody put Extra
Spaces somewhere it would just decide
that nope that's now another field
that's another column and you really
couldn't see it in the like if you
looked at the PDF it looked fine you had
to actually look at the the document
behind the scenes you had to look at the
Dom for that PDF to realize that oh they
actually you know threw another column
in there that's empty that has no value
but that's kind of crap that we you know
that you can run into yet another play
way that you can get into scraping and
integrating data and stuff like that
we've we've done it all it feels like
the last few years it's I that one what
uh over a decade ago uh back
when almost American old patient days
but uh there was a guy there was a site
out there for a while that had all the
old uh tabletop documents for like old
school Dungeons and Dragons BattleTech
um Mech Warrior like all the old stuff
that you can't get in anymore that's all
discontinued and he had it all online
all basically it was kind of like a
um Dropbox but this was kind of before
Dropbox was really popular and I just
wor a little scraper that went out
literally and it found all the links and
it downloaded all the documents in the
same file structure and I was able to uh
kind of parse through it and find the
ones I was looking for but I I couldn't
find them through the site so I was like
well here I'll just get them all and
I'll figure it out later and uh yeah
those are things you can do with these
scrapers it's not just scraping data but
you can actually download files you can
interact with things you can pull in
videos you can pull in images uh you
know lots of things you can do with it I
actually did that and this is bonus
material so there you go um I had a
customer that we Built Well I built out
a site for her and we had I don't know
20 30 different pages and screens and at
the end of the day she wanted to have a
a picture she wanted a screenshot of
every single screen as part of the the
user documentation and I was like oh my
gosh that's going to take a while but
what I ended up doing is a screen
scraper that just walked through
everything and it was easy enough
because I knew all the links I just I I
just crawled the site initially and said
boom take a every just go to each page
snap a picture go to a page snap a
picture and I think I had like three
others that was like yeah I just went in
and manually and just said added those
couple of links and said go there and
snap it and then suddenly had you know
whatever it was the 40 or 50 images that
she needed simple things like that it's
like you it's not till you think about
it and you're like oh
wait you sneeze and then you go hey
that's actually a better way to do it
than to manually go through it it's just
one of those things once you've done it
a couple times you look at some of these
things and it's like ah I really don't
want to do that a hundred times but it
would be maybe I can spend a few minutes
and especially with some of these tools
I can spend a little bit generate
something that I can just replicate over
and over again and have it do the work
for me instead of me doing the you know
going through that drudgery now it's
funny uh that you mentioned that because
on the testing side of things uh what
you can do especially with the selenium
web driver is you can walk through a s
sitemap and you can actually uh use
screenshot through web driver and it
will take so like if if you want to do a
screenshot in Chrome or Firefox you
essentially load the driver for that
browser and then do a snapshot of the
page that you're on so if you're
actually doing like U browser
comparisons to see how does your site
look on each browser uh what you can do
is you do that and then what you do is
you do a image compare and it will tell
you oh hey this image does not match
these and it it's kind of cool it very
simple there's good examples out there
for that but that's just another way
that you can uh do that with some online
tools that's a neat one we did that's
like a little
bit I'll say it's a what they what this
customer is doing was a little bit Shady
but it's legal but it's a little bit
like it's it's borderline stuff and that
was one of the things they
did is this was
um there a scraper is an automation tool
and it was very sensitive to certain
themes and some things like that and so
what they did is they had a they had uh
a crawler go through and it would just
walk through this and take the picture
of each of themes so then later they
could go and say am I on this page and
they could actually completely compare
that image as opposed
to you know trying to look for other
stuff because they would they would look
for Guess certain tags and those would
change things would move the formats
would change so instead it looked the
same basically it was enough that theyd
say okay if this is a you know 80% match
then we're basically on the right page
if it's not then we're you know on the
wrong page and they would do that with
like they did that even they went and
grabbed U they do buttons so instead of
trying to find the button control they
would go see if that image existed
somewhere in the the mapping and then
they would go use that so they could
figure out what the coordinates were so
they could press the button that way I
mean there's some there's some stuff
like that it's like because the first I
was like why do you have 18 pictures of
the same button and they're all just a
little bit different and it turned out
it has to do with um it was the way they
were spinning stuff up and it had to do
with the display resolution and then
also some of the the default color
themes would show up and so yeah it's
it's funny that kind of stuff that you
you don't think about like why would you
need to like take pictures as you're
walking through a site but there are a
lot of uses for such things well it's
even interesting now too the last thing
you mentioned there were they were
taking the pictures and you know to see
what was there well now like uh Amazon
has it's not poly but they've got one of
their libraries now where you can uh
basically look for text on an image and
then scrape the text from the image
as well so not only can you you know
scrape web pages but you can also scrape
images on a page as well that's true
that's one that I've seen some of it in
the LA in recent years I don't know how
well that works but I've seen a couple
of people that a couple places that use
it it seems to work for them pretty well
and it probably is U yeah same thing ply
I think is the speech to text but there
it's something like that there is I was
amazed at um some of the image
processing stuff and this was gosh this
like three or four years ago when I was
going through all Amazon Services they
had just opened a couple of those up and
it was built for U really was built
initially for augmented and virtual
reality stuff but it was saying so I I
could take a picture and I would it
would you'd have a series of pictures
you say show me the parrots and it would
go find the parrots in each of those and
he would it would it's not a picture of
a parrot it would just like I want show
me par and it would pull those out and
so things like that it's just it's
amazing the image processing stuff that
they can do and how well they can
actually search those now and that will
probably end up being the next you know
round of scraping is like hey I want to
you know grab some of the text off of
these images well the thing uh Apple's
got I think Google does too I know we're
getting a little long here but uh you
can also translate text so I've actually
been watching some shows and it's like
oh what's that say and I take a picture
of it and then I highlight the text and
I say translate and it'll translate it
it works for most languages um but it's
I mean it's amazing where we're going
with technology well that was I got
introduced to that gosh now it's
probably 10 15 years ago because I
Google translate was like you can slap
that on on any web page and there was
something I was doing for somebody I
don't remember who the customer was but
they wanted it they needed it in like
six different languages like I don't
know six different languages I mean
programming language is great but spoken
languages no I don't know if I can do
this and they were we were talking
through some stuff and they were like
going to go hire some people and rewrite
you know just extract everything out
into Strings and then convert you know
convert all of it and translate it all
and I I ended up playing around with the
Google translate and you just put a
little button there and it'll just like
you just like tell it what language and
boom it'll give you that language it was
two lines of code and it wasn't perfect
because it's just it was a you know just
Brute Force translation but it was close
enough for the languages they were using
they're like yeah that works what we
need you know that's what we need and
then you we said okay if you a language
that you need it you really want it and
it doesn't translate then we can always
go back and we can create that specific
page for that language there's some
things we can do but you stuff like that
that gets you that 8020 rule if it can
get you 80% of the there or better then
why not I mean that's say some of that
stuff it's it's almost free to use it or
it is so why not just take that
exactly all right I think we can call it
a wrap we got a couple episode's done
we'll come back next week we'll do it on
Thursday and uh we just keep chugging
along sounds good thanks again Rob for
moving around and dealing with the
weather and that um we're supposed to
get the really bad storms tomorrow so uh
I didn't think I wanted to take a chance
to lose an internet during that that's
good point and that's yeah it's sort of
frustrating when we do so the rest of
you guys goodbye have a good one and
yourself we'll talk to you again next
time and we'll just keep chugging along
and I'm sure we'll have plenty to rant
about next week as well
[Music]
Transcript Segments
1.35

[Music]

27.96

similar thing um we've talked about this

32.239

I think we did this one time at a mentor

34.64

class way back I think there was a

35.92

presentation that was and it's basically

37.84

because I had a conversation today with

39.239

somebody that brought it up again I was

41.2

thinking this may be a good good topic

43.2

to cover is talking about in

45.96

particularly scraping versus API now

49.12

this isn't as much of a probably not as

51.8

much a technical thing although I guess

53.359

it is some extent because I I I think

56.039

too often that particularly when you're

58.559

getting started you'll you know when

60.6

you're not when you haven't done a lot

62.16

as a

63.04

developer you'll hear you know somebody

65.199

will say hey we need to scrape this data

67.64

and then you just like okay well it's on

69.64

a website we got to go to that website

71.32

we've got to find a way to crawl it and

72.84

pull that because because that's a cool

75.08

fun challenging thing to do or because I

77.84

just did that in my you know my software

80

Class A couple years ago so I sort of

81.6

know how to do it and it is one of those

83.439

things like once you know how to do it

84.52

it's like cool I can do it it's slow

87.24

it's tedious it's painful depending on

89.159

how well they use IDs and whether you

90.84

have to use CSS selectors or xath and

93.2

all that kind of crud but it's also so

97.6

fragile and even with um there's some AI

101.32

projects that are working on that to try

102.799

to make that less fragile but even with

106.24

that apis are so much better and there's

109.32

just like even the idea of like

110.6

importing and exporting data like just

112.759

file uploads and you know that kind of

114.52

stuff that's they are ways to integrate

117.24

and to move data from system to system

119.399

that I think people don't think about as

122.2

much and so that's what I want to talk a

124.119

bit about it's it's really going to be

125.52

more like we talk about apis and some of

127.64

the the the pros and cons of those

130.84

versus like just go through a scraping

132.599

thing and stuff like that um I see is

135.64

that sound of like seem like something

137.12

You' got some you have some thoughts on

138.84

as well yeah

141.16

because incidentally that's also the

144.4

problem with the web automation with the

146.519

web testing because essentially having

150.2

to scrape to find the ID is to in uh

154.239

basically to interact with the using the

156.44

web driver with the pages and one of the

160.72

arguments I have not necessarily about

163.08

screen scraping but about website design

164.84

in general is good website development

168.48

requires IDs on your any input fields or

172.68

any interactive action fields that you

174.76

have on a page if you're not doing that

176.72

you're essentially writing bad code and

179.44

I don't care how much AI you can put out

181.48

there you're never going to fix the

183.12

problem you have to write the code

184.68

correctly or it's just not going to work

188.08

yeah so I think that's and that's

190.799

exactly the place one of the things we

192.4

want to talk about is because that's

195.48

that's part of why scraping is such a

197

pain in the butt because most code is

198.68

not written that way sometimes

200.2

intentionally but I think we can talk a

202.12

little bit we can dig a little bit into

203.519

the

204.159

whole we get a little nerdy on that

206.44

about getting into a control and how do

208.72

you how do you navigate to that and pull

210.48

that information back out

212.879

so hello and welcome back we are here

215.84

for another episode of building better

218.239

developers also known as the developer

220.48

or podcast actually I think it was it

222.92

was developer or first and then it

224.439

became building better developers but it

226.08

doesn't that's like neither here nor

227.879

there I'm Rob one of the founders of

230.4

develop and or and across the uh the

234.439

digiverse or whatever it is is Michael

237.4

welcome and I want you introduce

238.959

yourself as well

240.599

hey everyone my name is Michael molash

242.079

co-founder of develop Nur and founder of

244.159

Envision

245.36

QA today this episode we are going to

248.959

talk about we really it's at a high

251.68

level we're going to talk about

252.4

integration but we're going to take a

254.239

step down and talk about like really the

256.759

differences between scraping versus

259.84

using apis or other methods there are

263.759

many ways that you may need to or want

266.08

to ingest data into a system and I want

269.84

to talk a little bit about those because

271.88

there is I too often they sort of just

274.32

there's like a broad brush approach and

276.88

either everybody thinks that you're just

279.44

at the most simple of like all we can do

281.4

is import you know we have to have a CSV

283.759

and import that and that's the only way

284.88

we get data in or the other end of the

288.84

only way we can do this is have this

290.52

really complex web scraper that goes out

292.4

and crawls all these sites and grabs all

294.08

this information and then of course as

295.72

soon as any that information changes on

297.32

the site it's broken and we have to

299.039

rewrite the code

300.639

neither of those are 100% correct

303.52

although they might be for you it

304.919

depends on where you're

306.4

at so I think what I want to start with

309.68

is the let's start on the The Far Side

312.32

of that is the challenges of scraping

315.52

and in particular uh this is something

318

that's although it is near and dear to

320.36

my heart because I've had to do this

321.52

with several projects it's maybe even

323.12

nearer in Deer or however it is to

326.12

Michaels because this something that he

327.919

does as part of his

330.479

code generation tool and so I think

333.12

we'll start with that is what are the

335.479

let's start with what let's talk a

338.199

little bit what makes a good page for

341.8

scraping for getting information back

344.44

off of that page from a program

346.6

programmatic scent not we're not talking

348.199

user experience stuff but we're talking

350.28

the backend

353.36

side so I guess one additional thing

357

maybe you can touch on too is you know

360.319

why what is web scraping why why do we

363

use it before I get into that because I

364.919

want to make sure listeners understand

366.88

the difference between web scraping and

368.8

interacting with

370.4

apis all right I will take that volley

372.88

back and I'll work with that one so

375.24

scraping I have heard people that have

377.36

actually I've come across people that

379.16

think all of these things are the same

381.08

so for our purposes we're going to talk

382.68

about scraping is and I think it's

384.8

probably the technically correct way

386.199

that you look at it scraping is when you

388.12

go to either a a it's basically go to a

391.88

user interface and you from a program

395.039

are trying to do the same thing a hum

397.88

would do it actually comes all the way

400.36

it goes back to the old um main frames

404.88

and they would screen scrape what they

406.24

would do is they would have something

407.599

that would pull the display basically

411.96

back and it would go you know count out

414.16

15 columns over here and three rows down

416.12

and grab that value and then go six

418.599

columns over here and two rows up from

420.319

that and then grab that value so it's

422.44

literally like looking at your screen in

424.759

that case the grid that it was because

427.16

it was a i it's usually like a 40 by 25

430.4

grid and it was like where do you go on

432.36

it to get this specific value and then

435.44

sometimes it would be you know you grab

437.28

three blocks in a row to get a value

439.639

that's a string that you know is you

441.16

know three characters in length that has

444.639

advanced into the web world of go to any

448.28

page the easiest way to like to to get a

450.96

feel for it is to go to any web page

453.08

with like pretty much any browser these

454.8

days rightclick and then hit inspect and

458.199

you're going to end up getting somewhere

459.599

there you're going to get an option to

460.879

like pull out a JavaScript or a view

462.639

Source kind of thing and if you this is

464.68

just if you don't know you know if you

466.68

don't know HTML on that what you're

468.36

going to see is you're going to see what

469.68

is a essentially a formatted document

472.12

there's all of these little tags and

475.159

there's these ways that you build out

477.479

that page behind the scenes so if think

479.72

of that page is all these little

481.24

controls or these little widgets however

483.08

you best organize them the goal with

486.08

scraping is like for example if I'm

488

going to a an input form that has first

490.159

name last name email phone number and I

492.599

want to go grab the phone number then I

495.479

can go look at that document format and

497.52

I know that if I go to the you know the

500.24

email input field I can grab that that

503.56

value off of it or vice versa I can put

506.12

a value into that

508.08

field and how you get to

510.68

that is in itself a little bit of a

512.88

journey because there are multiple ways

514.519

to do that there are multiple ways to

516.839

tackle the the format and how we

519.8

navigate our way into those pages but

522

that's the scraping side API side is and

526.8

I I I figured I had like an example

529.04

today that I think probably works the

530.399

best is that if you want to go look up

533.24

information about let's say Michael you

535.519

can go to his website you can go find

537.6

his about page and then you can go look

539.64

for stuff that says his name his email

541.6

phone number all that kind of stuff and

543.72

you can go find that and then you

545.399

personally would have to go look at it

546.72

find it on the screen and then write it

548.519

down somewhere or if you know him enough

552.04

you could just text him something and

553.8

say hey I need your your phone number

555.72

your email address and he will give you

557.6

back in a way that you can easily read

559.68

quickly you know his email and his phone

562.16

same thing people do that all the times

564.16

with email you send somebody says hey I

566.839

need your address they'll give you a

568.24

nice formatted little ADD address thing

570.279

that's sort of what an a an API is is

572.24

that instead of you having to go look

574.2

through all this stuff you say hey this

576

is the stuff I want this is a data I

577.959

want and the API gives that back to you

581

in a really nice format usually it's

583.399

either depending on how you do it it's

584.92

usually going to be Json or XML or some

587.12

different things like that depending on

588.24

what you're doing but the the bonus is

591.519

it's well formatted and it is it's not

595

guesswork if you're scraping you're

597.399

trying to figure out is this B data even

599.48

available in an API it is a there is a

602.92

contract there of I'm going to give you

605.04

this information and you're going to

606.44

give me back these fields and the data

609.88

that match those now I think I got a

613

little long winded so sorry about that

614.68

so I'll throw you back to you Michel and

616.72

let's see if see if you can do and it

618.72

say anything in a shorter method than I

620.44

just did no that was really good and

623.64

thank you for doing that because I I

625.04

just wanted to make sure that we were

626.839

kind of on the Baseline one other thing

628.92

to kind of mentioned here because if

631.56

you're dealing with apis like Rob said

633.6

you're kind of dealing with the contract

635.04

you're dealing with a controlled way of

637.639

passing data from your system to a a

641.839

requestor a user someone trying to get

643.72

information from you screen scraping is

647.959

essentially going out to any site and

650.72

just basically trying to take the data

653.12

that is being displayed on a web

655

page and in many cases

659.36

not many companies want you to go do

661.399

that uh because they want their data to

663.399

be protected where which is where apis

665.88

are a little more powerful they're a

667.12

little more controlled and you can

668.959

actually put um restrictions and

671.72

security on top of them to only allow

673.92

certain people access to your data

675.88

whereas screen scraping if it's on the

677.76

web you can get it uh which is where a

680.24

lot of your AI tools today are out there

682.6

just kind of reading a whole bunch of

685.04

data from so many different pages

686.8

they're essentially scraping those pages

690.079

from a developer's perspective if I'm

692.92

building a site for screen scraping the

696.24

site needs to follow basic web page or

700.76

Web building techniques every element on

704.079

your page that contains data that is

706.959

necessary for consumption should have an

709.8

ID maybe a name depending upon which uh

714.56

you know if you're using uh Json or

716.56

whatever your tool is but essentially ID

719.399

is the universal ID it has to be unique

721.68

for all your elements on your page you

724.399

should never have two duplicate

727.079

IDs so if your developers are following

730.44

best practices and building pages with

734.88

IDs then you should have no problem you

736.92

should be able to just say hey go read

739.279

the Dom read pull down basically or

742.6

download the web page run a script to

744.8

find all the IDS on the page boom there

746.68

you go you can go grab all the data off

748.32

the page real simple similar to an API

751.32

but an API you actually make a request

753

you get back Json XML which is very easy

755.24

to parse um but like I said if you do

758.92

the API and it returns XML if you're

760.959

doing a screen scrape and your page has

763.839

good IDs you essentially could do the

765.92

same thing it's essentially you're

767.24

reading an XML you're reading a Dom

769.32

document and the two could almost be

771.639

identical if it's done correctly one of

774.6

the biggest problems we have especially

776.76

from a testing perspective is not a lot

779.36

of developers are building web pages

782.04

with IDs in mind they're not thinking

783.8

about testers they're not thinking about

785.32

web scrapers they're just trying to get

787.04

the code after as quickly as possible

789.6

Sometimes using tools would drag and

791.32

drop which those are great but those

794.079

typically will generate a random ID for

797.76

the page now every time the page gets

800.04

rebuilt all those IDs change they all

803.56

randomly change so if you're doing that

806.92

well yeah your page has IDs but now oh

809.72

next time I try to run a scraper it all

811.16

fails because none of the IDS are the

812.76

same so these are some of the problems

814.399

and restrictions we have to look at when

816.519

building Pages we have to think how is

819

this data going to be consumed is it

820.92

just someone interacting with the web

822.839

page is it do I need to make sure this

825.199

is tested do I have a more complex

827.199

interface where people have to go in and

828.839

actually use this like um like a patient

831.36

portal or you know an inventory

833.959

management system those things should

836.04

have IDs because one if you're consuming

839.16

data if you're sending data back to a

840.92

system to be saved you need to have some

842.959

identifier some unique identifier and

845.759

that's how we send information back and

847.6

forth so we know that oh hey I got this

850.399

field this field is this data stored in

852.959

this location if you don't do that how

854.759

do you know what data you're

856.68

saving and that's I think that gets us

859.6

into the one of the big differences

861.759

between the two when you think about it

863.12

is if you from an API you can say hey I

866.199

want let's say I want a customer data I

868.32

can go in I can ask I can say here's a

870.56

customer ID give me the information back

873.399

and it'll say oh for customer ID one

875.92

here's their name here's their address

878.199

blah blah blah if you're doing the

880.92

scrape particularly if you if you've got

884.519

these random IDs and some of these

886.16

things or no IDs then what you're doing

888.36

is you're going to say okay I need to

890.32

First find a way to get to the search

893.24

page that and then figure out how do I

895.44

search for that customer and then when I

897.24

get to that customer then I have to go

899.44

find somewhere in the document where is

901.759

the name field where is the address

903.44

field and if I have IDs then it's really

907.399

just a matter of you just go find the

909.72

control with that ID and you go oh the

911.92

name the name field is ID 46 whatever it

915.68

is and here's a value and you can pull

917.759

it

918.8

out if you don't have that you end up

921.759

having to walk the Dom basically so you

923.88

say well if I go to the customer search

928.36

result page page and I can go to the

930

customer detail then I know that if I

932.839

get three divs down then I'm into the

935.24

customer information and then within

936.639

that one there's two divs over and now

939

I'm into the address information and

940.759

within that I can go into there's a you

943.92

know maybe there's an href in there so I

945.44

can grab a link and then I can do that

947.759

and so you're having to like you're

949.319

having to walk the Dom just like you

952.279

know think about indexing anything you

954.56

either can walk your way all the way

955.959

through like if you want to search for a

957.24

record in database you can look at each

958.6

record go is it is it is it is it or if

961.56

you have an ID you can jump right to

963.279

that row and that's what IDs will give

966.759

you and so if you want something that is

969.759

crawlable that is scrapable I guess

972.399

they're always going to be crawlable

974

which let me talk about that for a

975.68

second crawling is really just baby step

979.48

scraping instead of getting into the

981.16

details and trying to pull data out all

983.72

you're doing is going into that document

985.44

and try to find anything in there that

988

would link you to to another page and

989.92

then go follow that link now you may say

993.079

well that's a piece of cake it's always

994.319

an hre no it is not it could be an hre

996.759

it could be JavaScript it could be a

998.759

whole bunch of other things it could be

1001.04

and it could be behind the scenes it

1002.72

could be something where it's actually a

1004.16

call back to the server and the server

1005.72

sends something back so there's a lot of

1007.759

different ways that that that can be

1009.639

handled which is why sometimes that's

1012.48

done on purpose so that you instead

1014.639

you're driven to the API side and then

1017.36

they say hey we're going to charge you x

1019.16

amount of money so you can subscribe to

1021.16

the API and then you can get going if

1023.88

you haven't looked at these I would or

1026.64

if this is new to you I would go look at

1028.76

scrapey S C A py and I think it's

1042.679

scrape.on I think there's a site with it

1044.679

as well uh that they have and it may be

1047.039

I forget what the other one they called

1048.28

but essentially it's just like so you

1049.6

can go do some scraping it'll help you

1052

sort of drag and drop but there's also

1053.52

some python behind it so that you can

1055.08

actually go see what it looks like the

1058.36

other thing I would do is pick just

1060.4

about anything go search for API for

1062.52

whatever your favorite tool is you know

1064

whether you're uh I don't know if like

1065.84

you use HubSpot or if you use Google go

1068.32

look at the Google apis or Amazon's apis

1071.24

there's pretty much everybody out there

1073.44

there are apis some are free and some

1075.679

are going to be the much better way to

1077.44

deal with it and then some are not

1079.64

because they may still be a better way

1081.559

to deal with it because you don't have

1082.64

to deal with all the headaches of that

1084.32

we've just talked about but then you

1086.799

also have to pay them for it because

1088.039

they're going to say hey we're going to

1089.08

make it easier for you we're going to

1090.28

make make it nice for you but hey you

1093.679

know share the love and so you're going

1095

to pay us whatever you need to pay

1098

thoughts on

1099.4

those so those are really good examples

1102.08

now there's one Edge case neither of us

1104.48

have touched on yet and that is

1107.2

embedded code or like iframes or

1111.84

essentially a page within a page if

1114.24

you're trying to do scraping that won't

1116.32

work if you're trying to stream the page

1118.84

and read the data through a stream the

1121.159

only way I've gotten around this is to

1123.52

actually physically uh download the

1126.679

complete contents of the page through

1130.159

like Firefox or Chrome with a plugin uh

1133.84

but only certain browsers even give you

1135.88

the functionality to get all that embed

1139

code with your source code of the page

1141.36

when you try to export it so that's

1144.32

another limitation with scraping is if

1147.679

you're trying to scrape pages that are

1149.039

actually loading data from other sites

1151.08

through an like an embedded uh plugin

1154.08

you might not be able to get it you

1155.679

might come into it and it's like oh it's

1157.4

not there you can't find it uh and

1159.72

that's just something else to think

1161.08

about as you're going through this uh

1163.72

one other thing to think about too

1166.919

is if you're doing from from the testing

1169.28

side or even if you're doing it from a

1170.6

scraping side uh look at selenium IDE

1174.12

it's a really cool free tool it's got

1176.48

plugins for just about every browser

1178.159

it's got a good desktop uh little

1180.24

application and What it lets you do is

1182.2

it basically you open up your browser

1184.679

open up selenium and you literally click

1187.2

through the page and as you click it

1189.559

records what you're doing and you can

1191.2

see if the elements you're working on on

1193.88

your page have IDs how is selenium

1196.799

seeing it is essentially going to tell

1198.48

you how hard is this project going to be

1200.64

to go scrape this page and a bonus to

1203.44

that is it's lend will while you you can

1206.039

export that in various languages so if

1209.12

you want to scrape if you want to have

1211.28

like the just a Brute Force scrape of

1214.919

data you can actually use selenium have

1217.84

it walk through all the steps generate

1219.76

that code out and it will actually do it

1221.88

for you you can just go run that in

1223.72

whatever your language is it it's got

1225.559

several that it does so let's say you

1227

know PHP or python or whatever and it

1229

will you can watch it open the browser

1231.72

up walk through all of that stuff do

1233.52

what it needs to do and then you know

1235.84

close the browser hopefully close the

1237.32

browser out otherwise hit close browser

1239.6

afterwards um but selenium is awesome

1242.36

selenium really is if you're if you're

1244.72

curious if you're wondering if this is

1246.88

something you can do then selenium is

1249.559

one that say that is probably the the

1251.919

fastest way to be able to get a page

1255.12

open it get data off of it and and be

1258.64

able to do that in a repetitive fashion

1261.159

now there are some you know there's some

1262.76

gotas to it and things like that but

1265.24

that really is like a excellent starting

1267.76

point for you and that's where you're

1268.72

going to see it all over the place

1269.76

anywhere that's anybody that's dealing

1271.48

with scraping in particular they're

1274.08

almost always going to mention Python

1275.32

and they're almost always going to

1276.24

mention selenium and selenium more so

1278.559

more so than anything else because that

1281.2

really is sort of the the industry

1283.559

standard I think for for like robotic

1287.76

behavior on a web page final

1291.76

thoughts yeah all I would like to say is

1294.679

for those of you getting into scraping

1296.679

if you haven't really done scraping

1298.12

before um like Rob said check out both

1302.279

the selenium scrap SC Scrappy Scrapy um

1306.6

and play with them uh it there's heck go

1310

to developing ner you know throw up uh

1312.48

sending ID or scrapey try to scrape our

1314.76

page some oh one thing we didn't touch

1317.52

on though is the other thing is some

1319.36

pages even offer RSS feeds which

1321.88

essentially are another way to scrape a

1325.24

page that we haven't really touched on

1327.799

but it is yet another source it's

1329.6

essentially another feeder that you can

1331.559

get like an API but it's public so you

1334.24

essentially hit a page here's all their

1336.039

data that you can digest and it's in a

1339.159

clean manner that's funny because as

1341.679

soon as you mention develop I like oh we

1343.48

forgot to mention RSS feeds RSS feeds

1346.6

just for those you guys that anybody

1348.159

that hasn't you probably have heard of

1350

RSS readers and such but you can go see

1353.84

the source for an RSS feed and they tend

1356.24

to be very well they're they're

1358.159

formatted documents they are it's very

1360.559

easy to crawl through them and get the

1362.32

data you need there are I've worked with

1365.039

many of those over the years um and used

1368.12

those actually as opposed to a scraper

1370.32

or even as opposed to an API because

1373.44

then for example we talked about I think

1375.679

it was like a couple episodes back now

1377.72

we talked about

1380

the uh and you guys have seen some of

1381.4

that if you if you've looked on the

1382.679

python side or actually I'm sorry it's

1384.36

the spring boot side where I built the

1388.24

um I think I talked about there anyways

1390.679

built the built the upwork integrator

1394

little app that goes out and grabs stuff

1395.96

off of upwork well I do that from an RSS

1398.24

feed because I can then take my specific

1401.76

search and it allows you to crank that

1404.12

out as an RSS feed and so then I can

1405.96

just hit that RSS feed pull stuff down

1408.48

and it's limited it's not going to give

1409.76

me everything it gives me I think 30 40

1411.919

50 records at a time at something but I

1414.2

can hit that I can go through everything

1416

it's it's XML so it's very easy to parse

1419.039

very easy to pull out the pieces of

1420.6

information I want and then I don't have

1422.88

to deal with an API I don't have to

1424.559

connect with that I don't have to deal

1425.72

with scraping and it's pretty darn solid

1428.159

I mean RSS feeds you tend to be pretty

1430.96

stable and they're just going to be out

1432.24

there so if you wanted to for example

1435.52

have like be scraping the articles that

1437.679

we have put out on the developing order

1439.48

site you can go to and basically any

1441.72

WordPress like go to the site RSS or

1444.84

rss2 depending on how it is and then you

1446.72

can see a nice XML document that you can

1450.279

parse and do what you you know do

1452.24

whatever you want to with it other than

1453.84

obviously repurpose it unless you tell

1455.88

us and then then we're cool about that

1458.72

that being said I think we're going to

1460.2

wrap this one up so as always shoot us

1463.4

email at info@ developin word.com if you

1465.36

have any questions if you have

1466.559

suggestions if you have any anything

1468.76

like that any feedback we'd love to hear

1470.48

it you can check us out on YouTube at

1472.159

develop and or and just go out there and

1474.72

have yourself a great day a great week

1476.84

and we will talk to you next

1480

time all right thoughts that seem to go

1483.64

pretty well too I thought yeah no I like

1485.799

that in fact it gave me an idea and I'm

1487.399

going to throw it out here so when I'm

1488.919

listening to this again I'll remember to

1490.44

do it uh I'm going to go ahead and throw

1493.559

the links out there for uh

1495.64

Scrapy um kind of like RS

1499.08

there there's a plugin for that uh one

1501.72

for slum IDE and then what I may even do

1504.36

is I may even throw out a simple python

1507.679

script to just uh screen scrape a page

1510.919

yeah yeah if you need that I've got like

1513.159

a thousand of those so so we do that we

1516.48

really we we do that so often it's it's

1519.96

it's sort of funny as it's now it's

1521.399

become just a standard well that's

1523.12

that's what the guy I was talking to

1524.32

today it's a new customer and he's like

1525.72

hey can we you know he wants some data

1527.559

and it's like can we do that like yeah

1528.84

people do that all the time literally

1531.08

like we' have done that for all kinds of

1533.399

different verticals and yeah it's just

1536.84

we'll go grab it and yeah it's not if

1539.36

you're scraping it that's the only thing

1540.96

is it's not 100% it's that you'll get

1543.44

some weird stuff even we didn't even get

1545.84

into this we had something where we were

1547.88

it was you pull down a PDF and you would

1550.44

think pull down a PDF like XML or CSV or

1553.919

anything else like just pull it down

1555

scrape it and you're good PDFs even it

1558.12

was was amazing it was obviously the

1560.559

same uh generation engine it was a

1563.36

report it was a nice little PDF report

1565.559

that we were trying to find the data on

1568.24

and it was it would periodically it

1570.76

would just be totally off it was just we

1573.24

could we finally got I think we got it

1574.799

to like better than 98% hit rate but it

1577.799

would be stuff where it was just

1579.799

like it's like NOP it's they added

1582.279

another column there and it's like there

1583.84

shouldn't be but they just add another

1585.24

column there and it would be because of

1587.679

uh for example if somebody put Extra

1589.52

Spaces somewhere it would just decide

1591.44

that nope that's now another field

1592.799

that's another column and you really

1594.799

couldn't see it in the like if you

1597.559

looked at the PDF it looked fine you had

1599.76

to actually look at the the document

1602.32

behind the scenes you had to look at the

1603.84

Dom for that PDF to realize that oh they

1608.36

actually you know threw another column

1610.2

in there that's empty that has no value

1612.52

but that's kind of crap that we you know

1614.76

that you can run into yet another play

1616.72

way that you can get into scraping and

1619.44

integrating data and stuff like that

1621

we've we've done it all it feels like

1622.48

the last few years it's I that one what

1625.559

uh over a decade ago uh back

1628.96

when almost American old patient days

1631.559

but uh there was a guy there was a site

1635

out there for a while that had all the

1638.279

old uh tabletop documents for like old

1642.32

school Dungeons and Dragons BattleTech

1644.919

um Mech Warrior like all the old stuff

1647.44

that you can't get in anymore that's all

1649.159

discontinued and he had it all online

1651.679

all basically it was kind of like a

1656.679

um Dropbox but this was kind of before

1659.76

Dropbox was really popular and I just

1662.96

wor a little scraper that went out

1664.36

literally and it found all the links and

1665.799

it downloaded all the documents in the

1667.48

same file structure and I was able to uh

1671.2

kind of parse through it and find the

1672.279

ones I was looking for but I I couldn't

1674.559

find them through the site so I was like

1676.88

well here I'll just get them all and

1678.2

I'll figure it out later and uh yeah

1680.84

those are things you can do with these

1682.679

scrapers it's not just scraping data but

1684.72

you can actually download files you can

1687.36

interact with things you can pull in

1689.44

videos you can pull in images uh you

1692

know lots of things you can do with it I

1693.76

actually did that and this is bonus

1696.32

material so there you go um I had a

1698.48

customer that we Built Well I built out

1701.12

a site for her and we had I don't know

1704.159

20 30 different pages and screens and at

1707.48

the end of the day she wanted to have a

1709.519

a picture she wanted a screenshot of

1711.6

every single screen as part of the the

1714.12

user documentation and I was like oh my

1716.679

gosh that's going to take a while but

1718

what I ended up doing is a screen

1719.32

scraper that just walked through

1720.76

everything and it was easy enough

1722.72

because I knew all the links I just I I

1725

just crawled the site initially and said

1727.36

boom take a every just go to each page

1729.559

snap a picture go to a page snap a

1731.32

picture and I think I had like three

1732.96

others that was like yeah I just went in

1735.12

and manually and just said added those

1736.88

couple of links and said go there and

1738.84

snap it and then suddenly had you know

1740.519

whatever it was the 40 or 50 images that

1742.44

she needed simple things like that it's

1745

like you it's not till you think about

1747.039

it and you're like oh

1750.36

wait you sneeze and then you go hey

1754.159

that's actually a better way to do it

1755.519

than to manually go through it it's just

1757.159

one of those things once you've done it

1758.2

a couple times you look at some of these

1760.36

things and it's like ah I really don't

1762.679

want to do that a hundred times but it

1765.399

would be maybe I can spend a few minutes

1767.76

and especially with some of these tools

1769.08

I can spend a little bit generate

1771.399

something that I can just replicate over

1772.919

and over again and have it do the work

1774.96

for me instead of me doing the you know

1776.76

going through that drudgery now it's

1778.76

funny uh that you mentioned that because

1780.919

on the testing side of things uh what

1783.08

you can do especially with the selenium

1785.039

web driver is you can walk through a s

1787.679

sitemap and you can actually uh use

1790.72

screenshot through web driver and it

1793.32

will take so like if if you want to do a

1795.48

screenshot in Chrome or Firefox you

1797.559

essentially load the driver for that

1799.72

browser and then do a snapshot of the

1802.24

page that you're on so if you're

1803.919

actually doing like U browser

1806.36

comparisons to see how does your site

1808.399

look on each browser uh what you can do

1811.08

is you do that and then what you do is

1812.679

you do a image compare and it will tell

1815.36

you oh hey this image does not match

1817.559

these and it it's kind of cool it very

1820.399

simple there's good examples out there

1822.32

for that but that's just another way

1823.679

that you can uh do that with some online

1826.24

tools that's a neat one we did that's

1828.24

like a little

1829.96

bit I'll say it's a what they what this

1832.399

customer is doing was a little bit Shady

1834.2

but it's legal but it's a little bit

1836.399

like it's it's borderline stuff and that

1838.519

was one of the things they

1840.159

did is this was

1842.799

um there a scraper is an automation tool

1846.159

and it was very sensitive to certain

1848.84

themes and some things like that and so

1851.48

what they did is they had a they had uh

1854.12

a crawler go through and it would just

1855.84

walk through this and take the picture

1857.36

of each of themes so then later they

1859.76

could go and say am I on this page and

1862.48

they could actually completely compare

1864.08

that image as opposed

1866.32

to you know trying to look for other

1868.48

stuff because they would they would look

1869.519

for Guess certain tags and those would

1870.96

change things would move the formats

1872.6

would change so instead it looked the

1874.799

same basically it was enough that theyd

1877

say okay if this is a you know 80% match

1880.039

then we're basically on the right page

1881.519

if it's not then we're you know on the

1883.72

wrong page and they would do that with

1885.12

like they did that even they went and

1887.039

grabbed U they do buttons so instead of

1889.88

trying to find the button control they

1892.2

would go see if that image existed

1894.12

somewhere in the the mapping and then

1896.12

they would go use that so they could

1897.919

figure out what the coordinates were so

1899.36

they could press the button that way I

1900.919

mean there's some there's some stuff

1902.48

like that it's like because the first I

1903.84

was like why do you have 18 pictures of

1906.519

the same button and they're all just a

1908.679

little bit different and it turned out

1910.039

it has to do with um it was the way they

1912.24

were spinning stuff up and it had to do

1913.72

with the display resolution and then

1915.679

also some of the the default color

1917.399

themes would show up and so yeah it's

1920.88

it's funny that kind of stuff that you

1922.76

you don't think about like why would you

1924.08

need to like take pictures as you're

1925.6

walking through a site but there are a

1927.679

lot of uses for such things well it's

1929.48

even interesting now too the last thing

1931.399

you mentioned there were they were

1932.48

taking the pictures and you know to see

1934.76

what was there well now like uh Amazon

1937.799

has it's not poly but they've got one of

1940.24

their libraries now where you can uh

1943.36

basically look for text on an image and

1946.2

then scrape the text from the image

1948.559

as well so not only can you you know

1950.559

scrape web pages but you can also scrape

1952.44

images on a page as well that's true

1955.24

that's one that I've seen some of it in

1956.48

the LA in recent years I don't know how

1958.08

well that works but I've seen a couple

1959.6

of people that a couple places that use

1961.08

it it seems to work for them pretty well

1963.08

and it probably is U yeah same thing ply

1966.36

I think is the speech to text but there

1967.96

it's something like that there is I was

1970.08

amazed at um some of the image

1973.519

processing stuff and this was gosh this

1976.12

like three or four years ago when I was

1977.32

going through all Amazon Services they

1979.2

had just opened a couple of those up and

1982.279

it was built for U really was built

1984.84

initially for augmented and virtual

1986.32

reality stuff but it was saying so I I

1988.639

could take a picture and I would it

1989.919

would you'd have a series of pictures

1991.24

you say show me the parrots and it would

1994

go find the parrots in each of those and

1996.12

he would it would it's not a picture of

1997.88

a parrot it would just like I want show

1999.919

me par and it would pull those out and

2002.279

so things like that it's just it's

2004.039

amazing the image processing stuff that

2006.039

they can do and how well they can

2007.88

actually search those now and that will

2010.36

probably end up being the next you know

2012.08

round of scraping is like hey I want to

2014.32

you know grab some of the text off of

2015.799

these images well the thing uh Apple's

2018.44

got I think Google does too I know we're

2020.399

getting a little long here but uh you

2023.48

can also translate text so I've actually

2026.159

been watching some shows and it's like

2027.88

oh what's that say and I take a picture

2029.88

of it and then I highlight the text and

2032.32

I say translate and it'll translate it

2034.48

it works for most languages um but it's

2037.399

I mean it's amazing where we're going

2038.84

with technology well that was I got

2041.039

introduced to that gosh now it's

2043.279

probably 10 15 years ago because I

2044.96

Google translate was like you can slap

2047.48

that on on any web page and there was

2049.52

something I was doing for somebody I

2051.079

don't remember who the customer was but

2052.48

they wanted it they needed it in like

2054.48

six different languages like I don't

2056.399

know six different languages I mean

2058

programming language is great but spoken

2060

languages no I don't know if I can do

2062.599

this and they were we were talking

2064.24

through some stuff and they were like

2065.48

going to go hire some people and rewrite

2067.72

you know just extract everything out

2069.56

into Strings and then convert you know

2071.24

convert all of it and translate it all

2073.599

and I I ended up playing around with the

2076.44

Google translate and you just put a

2077.599

little button there and it'll just like

2079.159

you just like tell it what language and

2080.76

boom it'll give you that language it was

2083.32

two lines of code and it wasn't perfect

2085.879

because it's just it was a you know just

2088.399

Brute Force translation but it was close

2091.24

enough for the languages they were using

2093.2

they're like yeah that works what we

2094.28

need you know that's what we need and

2095.599

then you we said okay if you a language

2098.28

that you need it you really want it and

2101.04

it doesn't translate then we can always

2103.76

go back and we can create that specific

2105.68

page for that language there's some

2106.76

things we can do but you stuff like that

2108.76

that gets you that 8020 rule if it can

2110.52

get you 80% of the there or better then

2113.88

why not I mean that's say some of that

2115.4

stuff it's it's almost free to use it or

2117.32

it is so why not just take that

2122.119

exactly all right I think we can call it

2126.32

a wrap we got a couple episode's done

2128.96

we'll come back next week we'll do it on

2131.52

Thursday and uh we just keep chugging

2133.92

along sounds good thanks again Rob for

2136.64

moving around and dealing with the

2138.839

weather and that um we're supposed to

2140.96

get the really bad storms tomorrow so uh

2143.8

I didn't think I wanted to take a chance

2145.92

to lose an internet during that that's

2148.76

good point and that's yeah it's sort of

2150.44

frustrating when we do so the rest of

2152.119

you guys goodbye have a good one and

2154.52

yourself we'll talk to you again next

2156.079

time and we'll just keep chugging along

2158.119

and I'm sure we'll have plenty to rant

2159.28

about next week as well

2162.19

[Music]