Summary
In this episode, we discuss the differences between scraping and APIs for data integration. Rob and Michael explore the challenges of scraping, including the need for unique IDs on web pages, and the benefits of using APIs for controlled data exchange.
Detailed Notes
The hosts, Rob and Michael, discussed the challenges of web scraping, including the need for unique IDs on web pages and the difficulties of navigating complex web pages. They also highlighted the benefits of using APIs for controlled data exchange, including improved security and efficiency. Rob and Michael provided several examples to illustrate their points, including the use of Selenium for web scraping. They also discussed the importance of considering the needs of both the developer and the user when building web applications.
Highlights
- Scraping is when you go to a user interface and you, from a program, are trying to do the same thing a human would do.
- APIs are a controlled way of passing data from your system to a requester, a user, someone trying to get information from you.
- If you're building a site or screen scraping, the site needs to follow basic web page or web building techniques.
- An API, you actually make a request, you get back JSON, XML, which is very easy to parse.
- Selenium is a really cool free tool, it's got plugins for just about every browser, it's got a good desktop application.
Key Takeaways
- Scraping is a more complex and less secure method of data integration compared to APIs.
- APIs provide a controlled way of passing data from one system to another.
- Selenium is a useful tool for web scraping, but it has limitations.
- RSS feeds can be used as an alternative to scraping or APIs for data integration.
- APIs and RSS feeds offer more flexibility and control over data exchange compared to scraping.
Practical Lessons
- Always consider the needs of both the developer and the user when building web applications.
- Use unique IDs on web pages to make scraping easier.
- APIs provide a more secure and efficient method of data exchange compared to scraping.
- Selenium is a useful tool for web scraping, but it has limitations.
- RSS feeds can be used as an alternative to scraping or APIs for data integration.
Strong Lines
- The main difference between scraping and APIs is the level of control and security they provide for data exchange.
- APIs provide a controlled way of passing data from one system to another.
- Selenium is a useful tool for web scraping, but it has limitations.
Blog Post Angles
- The importance of considering the needs of both the developer and the user when building web applications.
- The benefits of using APIs for controlled data exchange compared to scraping.
- The limitations of web scraping and the importance of using APIs or RSS feeds for data integration.
- The role of Selenium in web scraping and its limitations.
- The importance of unique IDs on web pages for easier scraping.
Keywords
- scraping
- APIs
- data integration
- Selenium
- RSS feeds
- web development
- data exchange
Transcript Text
Welcome to Building Better Developers, the Developer podcast, where we work on getting better step by step professionally and personally. Let's get started. Hello and welcome back. We are here for another episode of Building Better Developers, also known as the Developer podcast. Actually, I think it was it was Developing Her first and then it became Building Better Developers, but it doesn't. That's like neither here nor there. I'm Rob, one of the founders of Developing Her and across the the digiverse or whatever it is, is Michael. Welcome and I want you to introduce yourself as well. Hey everyone, my name is Michael Molles, co-founder of Developing Her and founder of Envision QA. Today, this episode, we are going to talk about, we're really, it's at a high level, we're going to talk about integration, but we're going to take a step down and talk about like really the differences between scraping versus using APIs or other methods. There are many ways that you may need to or want to ingest data into a system. And I want to talk a little bit about those because there is, too often they sort of just there's like a broad brush approach and either everybody thinks that you're just at the most simple of like, all we can do is import, you know, we have to have a CSV and import that and it's the only way we can get data in. Or the other end of the only way we can do this is have this really complex web scraper that goes out and crawls all these sites and grabs all this information. And then of course, as soon as any of that information changes on the site, it's broken and we have to rewrite the code. Neither of those are 100% correct, although they might be for you, it depends on where you're at. So I think what I want to start with is the, let's start on the far side of that is the challenges of scraping. And in particular, this is something that's, although it is near and dear to my heart because I've had to do this with several projects, it's maybe even nearer and dearer or however it is to Michael's because it's something that he does as part of his code generation tool. And so I think we'll start with that is what are the, let's start with what, let's start with what makes a good page for scraping, for getting information back off of that page from a programmatic set. We're not talking user experience stuff, but we're talking the backend side. So I guess one additional thing maybe you can touch on too is, you know, why, what is web scraping? Why would, why do we use it before I get into that? Because I want to make sure that listeners understand the difference between web scraping and interacting with APIs. All right. I will take that volley back and I'll work with that one. So scraping, I have heard people that have actually, I've come across people that think all of these things are the same. So for our purposes, we're going to talk about scraping is, and I think it's probably the technically correct way that you look at it. Scraping is when you go to either a, it's basically go to a user interface and you from a program are trying to do the same thing a human would do. It actually comes all the way, it goes back to the old mainframes and they would screen scrape what they would do is they would have something that would pull the display basically back and it would go, you know, count out 15 columns over here and three rows down and grab that value and then go six columns over here and two rows up from that and then grab that value. So it's literally like looking at your screen. In that case, the grid that it was because it was a, I guess, usually like a 40 by 25 grid and it was like, where do you go on it to get this specific value? And then sometimes it would be, you know, you grab three blocks in a row to get a value that's a string that you know, is three characters in length that has advanced into the web world of go to any page. The easiest way to like to get a feel for it is to go to any web page with like pretty much any browser these days, right click and then hit inspect and you're going to end up getting somewhere there. You're going to get an option to like pull out a JavaScript or a view source kind of thing. And if you, this is just, if you don't know, you know, if you don't know HTML on that, what you're going to see is you're going to see what is a, it's actually a formatted document. There's all of these little tags and there's these ways that you build out that page behind the scenes. So if you think of that page as all these little controls or these little widgets, however you best organize them, the goal with scraping is like, for example, if I'm going to a, an input form that has first name, last name, email, phone number, and I want to go grab the phone number, then I can go look at that document format. And I know that if I go to the, you know, the email input field, I can grab that, that value off of it or vice versa. I can put a value into that field and how you get to that is in itself a little bit of a journey because there are multiple ways to do that. There are multiple ways to tackle the format and how we navigate our way into those pages. But that's the scraping side. API side is, and I figured I had like an example today that I think probably works the best is that if you want to go look up information about, let's say, Michael, you can go to his website, you can go find his about page, and then you can go look for stuff that says his name, his email, phone number, all that kind of stuff. And you can go find that. And then you personally would have to go look at it, find it on the screen, and then write it down somewhere. Or if you know him enough, you could just text him something and say, hey, I need your phone number and email address. And he will give you back in a way that you can easily read quickly, you know, his email and his phone. Same thing, people do that all the time with email. You send, somebody says, hey, I need your address. They'll give you a nice formatted little address thing. That's sort of what an API is, is that instead of you having to go look through all this stuff, you say, hey, this is the stuff I want. This is the data I want. And the API gives that back to you in a really nice format. Usually, it's either, depending on how you do it, it's usually going to be JSON or XML or some different things like that, depending on what you're doing. But the bonus is, it's well formatted. And it's not guesswork. If you're scraping, you're trying to figure out, is this data even available? In an API, there is a contract there of, I'm going to give you this information, and you're going to give me back these fields and the data that match those. Now, I think I got a little long-winded, so sorry about that. So I'll throw you back to you, Michael, and let's see if you can say anything in a shorter method than I just did. No, that was really good. And thank you for doing that because I just wanted to make sure that we were kind of on the baseline. One other thing to kind of mention here, because if you're dealing with APIs, like Rob said, you're kind of dealing with the contractor, dealing with a controlled way of passing data from your system to a requester, a user, someone trying to get information from you. Screen scraping is essentially going out to any site and just basically trying to take the data that is being displayed on a webpage. And in many cases, not many companies want you to go do that because they want their data to be protected, which is where APIs are a little more powerful, they're a little more controlled, and you can actually put restrictions and security on top of them to only allow certain people access to your data. Whereas screen scraping, if it's on the web, you can get it, which is where a lot of your AI tools today are out there just kind of reading a whole bunch of data from so many different pages. They're essentially scraping those pages. From a developer's perspective, if I'm building a site or screen scraping, the site needs to follow basic web page or web building techniques. Every element on your page that contains data that is necessary for consumption should have an ID, maybe a name, depending upon which, you know, if you're using JSON or whatever your tool is. But essentially, ID is the universal ID. It has to be unique for all your elements on your page. You should never have two duplicate IDs. So if your developers are following best practices and building pages with IDs, then you should have no problem. You should be able to just say, hey, go read the DOM, read, pull down, basically download the webpage, run a script to find all the IDs on the page. Boom, there you go. You can go grab all the data off the page. Real simple, similar to an API. But an API, you actually make a request, you get back JSON, XML, which is very easy to parse. But like I said, if you do the API and it returns XML, if you're doing a screen scrape and your page has good IDs, you essentially could do the same thing. It's essentially you're reading an XML, you're reading a DOM document, and the two could almost be identical if it's done correctly. One of the biggest problems we have, especially from a testing perspective, is not a lot of developers are building web pages with IDs in mind. They're not thinking about testers. They're not thinking about web scrapers. They're just trying to get the code out there as quickly as possible, sometimes using tools with drag and drop, which those are great, but those typically will generate a random ID for the page. Now every time the page gets rebuilt, all those IDs change. They all randomly change. So if you're doing that, well, yeah, your page has IDs, but now, oh, next time I try to run the scraper, it all fails because none of the IDs are the same. So these are some of the problems and restrictions we have to look at when building pages. We have to think how is this data going to be consumed? Is it just someone interacting with the web page? Is it, do I need to make sure this is tested? Do I have a more complex interface where people have to go in and actually use this like a patient portal or an inventory management system? Those things should have IDs because one, if you're consuming data, if you're sending data back to a system to be saved, you need to have some identifier, some unique identifier. And that's how we send information back and forth so we know that, oh, hey, I got this field. This field is this data stored in this location. If you don't do that, how do you know what data you're saving? I think that gets us into the, one of the big differences between the two when you think about it is if you, from an API, you can say, hey, I want, let's say I want a customer data. I can go in, I can ask, I can say, here's a customer ID, give me the information back. And it'll say, oh, for customer ID one, here's their name, here's their address, blah, blah, blah. If you're doing the scrape, particularly if you've got these random IDs and some of these things or no IDs, then what you're doing is you're going to say, okay, I need to first find a way to get to the search page and then figure out how do I search for that customer? And then when I get to that customer, then I have to go find somewhere in the document, where is the name field? Where is the address field? If I have IDs, then it's really just a matter of you just go find the control with that ID and you go, oh, the name, the name field is ID 46, whatever it is. And here's the value and you can pull it out. If you don't have that, you end up having to walk the dom basically. So you say, well, if I go to the customer search result page and I can go to the customer detail, then I know that if I get three divs down, then I'm into the customer information. And then within that one, there's two divs over and now I'm into the address information. And within that, I can go into, there's a, you know, maybe there's an href in there so I can grab a link and then I can do that. And so you're having to like, you're having to walk the dom just like, you know, think about indexing anything. You either can walk your way all the way through. Like if you want to search for a record in a database, you can look at each record and go, is it, is it, is it, is it? Or if you have an ID, you can jump right to that row. And that's what IDs will give you. And so if you want something that is crawlable, that is scrapable, I guess they're always going to be crawlable, which let me talk about that for a second. Crawling is really just baby step scraping. Instead of getting into the details and trying to pull data out, all you're doing is going into that document and try to find anything in there that would link you to another page and then go follow that link. Now you may say, well, that's a piece of cake. It's always an href. No, it is not. It could be an href. It could be JavaScript. It could be a whole bunch of other things. It could be, and it could be behind the scenes. It could be something where it's actually a call back to the server and the server sends something back. So there's a lot of different ways that that, that can be handled, which is why sometimes that's done on purpose so that you instead you're driven to the API side. And then they say, Hey, we're going to charge you X amount of money so you can subscribe to the API and then you can get going. If you haven't looked at these, I would, or if this is new to you, I would go look at Scrapey S C R A P Y. And I think it's Scrapey.com. There's a Scrapey site out there that will help you like sort of walk you through stuff. Now this is through, yeah, so Scrapey.org. And I think there's a site with it as well that they have. And it may be, I forget what the other one they called, but essentially it's just like, so you can go do some scraping. It'll help you sort of drag and drop, but there's also some Python behind it so that you can actually go see what it looks like. The other thing I would do is pick just about anything. Go search for API for whatever your favorite tool is. Whether you're, I don't know if you use HubSpot or if you use Google, go look at the Google APIs or Amazon's APIs. There's pretty much everybody out there. There are APIs, some are free and some are going to be the much better way to deal with it. And then some are not because they may still be a better way to deal with it because you don't have to deal with all the headaches that we've just talked about. Then you also have to pay them for it because they're going to say, Hey, we're going to make it easier for you. We're going to make it nice for you, but Hey, you know, share the love. And so you're going to pay us whatever you need to pay thoughts on those. So those are really good examples. Now there's one edge case. Neither of us have touched on yet, and that is embedded code or like if I frames or essentially a page within a page. If you're trying to do scraping, that won't work if you're trying to stream the page and read the data through a stream. The only way I've gotten around this is to actually physically download the complete contents of the page through like Firefox or Chrome with a plugin. But only certain browsers even give you the functionality to get all that embedded code with your source code of the page when you try to export it. So that's another limitation with scraping is if you're trying to scrape pages that are actually loading data from other sites through an embedded plugin, you might not be able to get it. You might come into it and it's like, Oh, it's not there. It can't find it. And that's just something else to think about as you're going through this. One other thing to think about too is if you're doing from the testing side or even if you're doing it from a scraping side, look at Selenium IDE. It's a really cool free tool. It's got plugins for just about every browser. It's got a good desktop, a little application. And what it lets you do is it basically you open up your browser, open up Selenium and you literally click through the page. And as you click, it records what you're doing and you can see if the elements you're working on on your page have IDs. How is Selenium seeing it? It's essentially going to tell you how hard is this project going to be to go scrape this page. And a bonus to that is that Selenium will allow you, you can export that in various languages. So if you want to scrape, if you want to have like the just a brute force scrape of data, you can actually use Selenium, have it walk through all the steps, generate that code out and it will actually do it for you. You can just go run that in whatever your language is. It's got several that it does. So let's say, you know, PHP or Python or whatever, and it will, you can watch it, open the browser up, walk through all of that stuff, do what it needs to do and then, you know, close the browser, hopefully close the browser out. Otherwise hit close browser afterwards. But Selenium is awesome. Selenium really is. If you're, if you're curious, if you're wondering if this is something you can do, then Selenium is one that say that is probably the fastest way to be able to get a page, open it, get data off of it and be able to do that in a repetitive fashion. Now there are some, you know, there's some gotchas to it and things like that, but that really is like an excellent starting point for you. And that's where you're going to see it all over the place. Anywhere that's anybody that's dealing with scraping in particular, they're almost always going to mention Python and they're also always going to mention Selenium and Selenium more so than anything else, because that really is sort of the, the industry standard, I think for, for like robotic behavior on a web page. Final thoughts? Yeah, all I would like to say is for those of you getting into scraping, if you haven't really done scraping before, like Rob said, check out both the Selenium scrap, scrap, scrappy, scrapy and play with them. Uh, there's heck go to developer nerd, you know, throw up a Selenium ID or scrapy, try to scrape our page. Some, oh, one thing we didn't touch on though is the other thing is some pages even offer RSS feeds, which essentially are another way to scrape a page that we haven't really touched on, but it is yet another source. It's essentially another feeder that you can get like an API, but it's public. So you essentially hit a page. Here's all their data that you can digest and it's in a clean manner. That's funny. Cause as soon as you mentioned developing, where I was like, Oh, we forgot to mention RSS feeds, RSS feeds, just for those of you guys that anybody that has it, you probably have heard of RSS readers and such, but you can go see the source for an RSS feed and they tend to be very well, they're, they're formatted documents. They are, it's very easy to crawl through them and get the data you need. There are, I've worked with many of those over the years, um, and use those actually as opposed to a scraper or even as opposed to an API, because then for example, we talked about, I think it was like a couple episodes back now. We talked about the, uh, and you guys have seen some of that. If you, if you've looked on the Python side or actually, I'm sorry, it's the spring boot side where I built the, um, I think I talked about there anyways, built the Upwork integrator little app that goes out and grabs stuff off of Upwork. Well, I do that from an RSS feed because I can then take my specific search and it allows you to crank that out as an RSS feed. And so then I can just hit that RSS feed, pull stuff down and it's limited. It's not going to give me everything. It gives me, I think 30, 40, 50 records at a time at something, but I can hit that. I can go through everything. It's it's XML. So it's very easy to parse, very easy to pull out the pieces of information I want. And then I don't have to deal with an API. I don't have to connect with that. I don't have to deal with scraping. And it's pretty darn solid. I mean, RSS feeds tend to be pretty stable and they're just going to be out there. So if you wanted to, for example, have like be scraping the articles that we have put out on the developer or site, you can go to and basically any WordPress, like go to the site slash RSS or RSS two, depending on how it is. And then you can see a nice XML document that you can parse and do what you do, whatever you want to with it, other than obviously repurpose it unless you tell us. And then then we're cool about that. That being said, I think we're going to wrap this one up. So as always shoot us an email at info at developinure.com. If you have any questions, if you have suggestions, if you have anything like that, any feedback, we'd love to hear it. You can check us out on YouTube at developinure and just go out there and help yourself a great day, a great week. And we will talk to you next time. Thank you for listening to building better developers, the developinure podcast. You can subscribe on Apple podcasts, Stitcher, Amazon, anywhere that you can find podcasts, we are there. And remember just a little bit of effort every day ends up adding into great momentum and great success.