Detailed Notes
In the latest Develpreneur Podcast episode, hosts Rob and Michael explore data integration methods. Focus on scraping versus using APIs. They have experience in both realms. Dissect the challenges and advantages of each approach. Offer valuable insights for developers and data enthusiasts.
Using Scraping for Data Integration
What is scrapping?
Scraping involves programmatically extracting data from web pages, mimicking human interaction with the user interface. Today, web scraping involves navigating HTML structures, identifying elements by their IDs, and extracting relevant information.
Inconsistent IDs and Embedded Content
Scraping challenges arise when pages lack consistent IDs or contain embedded content like iframes. On the other hand, APIs provide a structured means of accessing data, offering clear endpoints and formatted responses, typically in JSON or XML.
Streamlining Scraping with Selenium IDE
Rob underscores the importance of developers incorporating IDs into web page elements for easier scraping. He recommends using Scrapy and Selenium IDE. These are useful tools for scrapping interactions, which provide valuable insights into a page's scrapeability.
Using APIs for Data Integration
What are Apis?
An API is a set of rules for software communication. It defines methods and data formats for requesting and exchanging information. APIs enable seamless integration between systems. They provide structured data access, clear endpoints, and formatted responses. Unlike scraping, APIs follow contractual agreements. This simplifies data retrieval and ensures consistency.
Controlled Access and Security
Michael highlights the advantages of APIs, emphasizing their controlled access and security features. Unlike scraping, which can be hindered by page changes and inconsistencies, APIs offer a reliable and secure way to access data, with built-in authentication and authorization mechanisms.
Simplifying Data Retrieval
API contracts define the expected behavior and data format for interacting with an API, making it easier for developers to integrate and consume data. By adhering to these contracts, developers can streamline the data retrieval process and avoid potential errors.
Understanding Endpoints and Parameters
Rob and Michael stress the importance of thoroughly understanding API documentation, which outlines endpoints, request parameters, authentication methods, and response formats. Clear documentation enables developers to effectively use APIs and integrate data into their applications.
Exploring Alternative Data Source
The Significance of RSS Feeds
An RSS feed publishes frequently updated content. It uses the Really Simple Syndication format. Blog posts, news, and podcasts get published via RSS. Users subscribe to the website's RSS feed. New entries get aggregated into a single feed. Feed readers, browsers access the RSS feed.
RSS Feeds contain a lot of relevant information
RSS feeds offer easily parsed XML documents, simplifying data extraction compared to scraping or API integration. These feeds include metadata, content summaries, and links, enabling users to stay updated on preferred websites effortlessly.
In conclusion, Rob and Michael recommend exploring scraping, API methods, and RSS feeds. Consider using tools like Scrapy and Selenium for scraping. Also, familiarize yourself with various APIs for data retrieval. These tips will provide you with a solid knowledge of scraping, APIs, and RSS feeds so developers can navigate data integration confidently and efficiently.
Feedback and questions are welcome at [email protected], and listeners are invited to connect with Develpreneur on YouTube for more insights and discussions. By focusing on mastering data integration, developers can unlock new possibilities and streamline their workflows.
Additional Resources * Restful API Testing With RestAssured - https://develpreneur.com/wp-content/uploads/2022/02/Restful-API-Testing-With-RestAssured.jpg * Restful API Testing Using RestAssured - https://develpreneur.com/restful-api-testing-using-restassured/ * RSS Reader And Promoter – A Product Walk Through - https://develpreneur.com/rss-reader-and-promoter-a-product-walk-through/ * Scrapy - https://scrapy.org/ * Selenium - https://www.selenium.dev/selenium-ide/
Transcript Text
[Music] similar thing um we've talked about this I think we did this one time at a mentor class way back I think there was a presentation that was and it's basically because I had a conversation today with somebody that brought it up again I was thinking this may be a good good topic to cover is talking about in particularly scraping versus API now this isn't as much of a probably not as much a technical thing although I guess it is some extent because I I I think too often that particularly when you're getting started you'll you know when you're not when you haven't done a lot as a developer you'll hear you know somebody will say hey we need to scrape this data and then you just like okay well it's on a website we got to go to that website we've got to find a way to crawl it and pull that because because that's a cool fun challenging thing to do or because I just did that in my you know my software Class A couple years ago so I sort of know how to do it and it is one of those things like once you know how to do it it's like cool I can do it it's slow it's tedious it's painful depending on how well they use IDs and whether you have to use CSS selectors or xath and all that kind of crud but it's also so fragile and even with um there's some AI projects that are working on that to try to make that less fragile but even with that apis are so much better and there's just like even the idea of like importing and exporting data like just file uploads and you know that kind of stuff that's they are ways to integrate and to move data from system to system that I think people don't think about as much and so that's what I want to talk a bit about it's it's really going to be more like we talk about apis and some of the the the pros and cons of those versus like just go through a scraping thing and stuff like that um I see is that sound of like seem like something You' got some you have some thoughts on as well yeah because incidentally that's also the problem with the web automation with the web testing because essentially having to scrape to find the ID is to in uh basically to interact with the using the web driver with the pages and one of the arguments I have not necessarily about screen scraping but about website design in general is good website development requires IDs on your any input fields or any interactive action fields that you have on a page if you're not doing that you're essentially writing bad code and I don't care how much AI you can put out there you're never going to fix the problem you have to write the code correctly or it's just not going to work yeah so I think that's and that's exactly the place one of the things we want to talk about is because that's that's part of why scraping is such a pain in the butt because most code is not written that way sometimes intentionally but I think we can talk a little bit we can dig a little bit into the whole we get a little nerdy on that about getting into a control and how do you how do you navigate to that and pull that information back out so hello and welcome back we are here for another episode of building better developers also known as the developer or podcast actually I think it was it was developer or first and then it became building better developers but it doesn't that's like neither here nor there I'm Rob one of the founders of develop and or and across the uh the digiverse or whatever it is is Michael welcome and I want you introduce yourself as well hey everyone my name is Michael molash co-founder of develop Nur and founder of Envision QA today this episode we are going to talk about we really it's at a high level we're going to talk about integration but we're going to take a step down and talk about like really the differences between scraping versus using apis or other methods there are many ways that you may need to or want to ingest data into a system and I want to talk a little bit about those because there is I too often they sort of just there's like a broad brush approach and either everybody thinks that you're just at the most simple of like all we can do is import you know we have to have a CSV and import that and that's the only way we get data in or the other end of the only way we can do this is have this really complex web scraper that goes out and crawls all these sites and grabs all this information and then of course as soon as any that information changes on the site it's broken and we have to rewrite the code neither of those are 100% correct although they might be for you it depends on where you're at so I think what I want to start with is the let's start on the The Far Side of that is the challenges of scraping and in particular uh this is something that's although it is near and dear to my heart because I've had to do this with several projects it's maybe even nearer in Deer or however it is to Michaels because this something that he does as part of his code generation tool and so I think we'll start with that is what are the let's start with what let's talk a little bit what makes a good page for scraping for getting information back off of that page from a program programmatic scent not we're not talking user experience stuff but we're talking the backend side so I guess one additional thing maybe you can touch on too is you know why what is web scraping why why do we use it before I get into that because I want to make sure listeners understand the difference between web scraping and interacting with apis all right I will take that volley back and I'll work with that one so scraping I have heard people that have actually I've come across people that think all of these things are the same so for our purposes we're going to talk about scraping is and I think it's probably the technically correct way that you look at it scraping is when you go to either a a it's basically go to a user interface and you from a program are trying to do the same thing a hum would do it actually comes all the way it goes back to the old um main frames and they would screen scrape what they would do is they would have something that would pull the display basically back and it would go you know count out 15 columns over here and three rows down and grab that value and then go six columns over here and two rows up from that and then grab that value so it's literally like looking at your screen in that case the grid that it was because it was a i it's usually like a 40 by 25 grid and it was like where do you go on it to get this specific value and then sometimes it would be you know you grab three blocks in a row to get a value that's a string that you know is you know three characters in length that has advanced into the web world of go to any page the easiest way to like to to get a feel for it is to go to any web page with like pretty much any browser these days rightclick and then hit inspect and you're going to end up getting somewhere there you're going to get an option to like pull out a JavaScript or a view Source kind of thing and if you this is just if you don't know you know if you don't know HTML on that what you're going to see is you're going to see what is a essentially a formatted document there's all of these little tags and there's these ways that you build out that page behind the scenes so if think of that page is all these little controls or these little widgets however you best organize them the goal with scraping is like for example if I'm going to a an input form that has first name last name email phone number and I want to go grab the phone number then I can go look at that document format and I know that if I go to the you know the email input field I can grab that that value off of it or vice versa I can put a value into that field and how you get to that is in itself a little bit of a journey because there are multiple ways to do that there are multiple ways to tackle the the format and how we navigate our way into those pages but that's the scraping side API side is and I I I figured I had like an example today that I think probably works the best is that if you want to go look up information about let's say Michael you can go to his website you can go find his about page and then you can go look for stuff that says his name his email phone number all that kind of stuff and you can go find that and then you personally would have to go look at it find it on the screen and then write it down somewhere or if you know him enough you could just text him something and say hey I need your your phone number your email address and he will give you back in a way that you can easily read quickly you know his email and his phone same thing people do that all the times with email you send somebody says hey I need your address they'll give you a nice formatted little ADD address thing that's sort of what an a an API is is that instead of you having to go look through all this stuff you say hey this is the stuff I want this is a data I want and the API gives that back to you in a really nice format usually it's either depending on how you do it it's usually going to be Json or XML or some different things like that depending on what you're doing but the the bonus is it's well formatted and it is it's not guesswork if you're scraping you're trying to figure out is this B data even available in an API it is a there is a contract there of I'm going to give you this information and you're going to give me back these fields and the data that match those now I think I got a little long winded so sorry about that so I'll throw you back to you Michel and let's see if see if you can do and it say anything in a shorter method than I just did no that was really good and thank you for doing that because I I just wanted to make sure that we were kind of on the Baseline one other thing to kind of mentioned here because if you're dealing with apis like Rob said you're kind of dealing with the contract you're dealing with a controlled way of passing data from your system to a a requestor a user someone trying to get information from you screen scraping is essentially going out to any site and just basically trying to take the data that is being displayed on a web page and in many cases not many companies want you to go do that uh because they want their data to be protected where which is where apis are a little more powerful they're a little more controlled and you can actually put um restrictions and security on top of them to only allow certain people access to your data whereas screen scraping if it's on the web you can get it uh which is where a lot of your AI tools today are out there just kind of reading a whole bunch of data from so many different pages they're essentially scraping those pages from a developer's perspective if I'm building a site for screen scraping the site needs to follow basic web page or Web building techniques every element on your page that contains data that is necessary for consumption should have an ID maybe a name depending upon which uh you know if you're using uh Json or whatever your tool is but essentially ID is the universal ID it has to be unique for all your elements on your page you should never have two duplicate IDs so if your developers are following best practices and building pages with IDs then you should have no problem you should be able to just say hey go read the Dom read pull down basically or download the web page run a script to find all the IDS on the page boom there you go you can go grab all the data off the page real simple similar to an API but an API you actually make a request you get back Json XML which is very easy to parse um but like I said if you do the API and it returns XML if you're doing a screen scrape and your page has good IDs you essentially could do the same thing it's essentially you're reading an XML you're reading a Dom document and the two could almost be identical if it's done correctly one of the biggest problems we have especially from a testing perspective is not a lot of developers are building web pages with IDs in mind they're not thinking about testers they're not thinking about web scrapers they're just trying to get the code after as quickly as possible Sometimes using tools would drag and drop which those are great but those typically will generate a random ID for the page now every time the page gets rebuilt all those IDs change they all randomly change so if you're doing that well yeah your page has IDs but now oh next time I try to run a scraper it all fails because none of the IDS are the same so these are some of the problems and restrictions we have to look at when building Pages we have to think how is this data going to be consumed is it just someone interacting with the web page is it do I need to make sure this is tested do I have a more complex interface where people have to go in and actually use this like um like a patient portal or you know an inventory management system those things should have IDs because one if you're consuming data if you're sending data back to a system to be saved you need to have some identifier some unique identifier and that's how we send information back and forth so we know that oh hey I got this field this field is this data stored in this location if you don't do that how do you know what data you're saving and that's I think that gets us into the one of the big differences between the two when you think about it is if you from an API you can say hey I want let's say I want a customer data I can go in I can ask I can say here's a customer ID give me the information back and it'll say oh for customer ID one here's their name here's their address blah blah blah if you're doing the scrape particularly if you if you've got these random IDs and some of these things or no IDs then what you're doing is you're going to say okay I need to First find a way to get to the search page that and then figure out how do I search for that customer and then when I get to that customer then I have to go find somewhere in the document where is the name field where is the address field and if I have IDs then it's really just a matter of you just go find the control with that ID and you go oh the name the name field is ID 46 whatever it is and here's a value and you can pull it out if you don't have that you end up having to walk the Dom basically so you say well if I go to the customer search result page page and I can go to the customer detail then I know that if I get three divs down then I'm into the customer information and then within that one there's two divs over and now I'm into the address information and within that I can go into there's a you know maybe there's an href in there so I can grab a link and then I can do that and so you're having to like you're having to walk the Dom just like you know think about indexing anything you either can walk your way all the way through like if you want to search for a record in database you can look at each record go is it is it is it is it or if you have an ID you can jump right to that row and that's what IDs will give you and so if you want something that is crawlable that is scrapable I guess they're always going to be crawlable which let me talk about that for a second crawling is really just baby step scraping instead of getting into the details and trying to pull data out all you're doing is going into that document and try to find anything in there that would link you to to another page and then go follow that link now you may say well that's a piece of cake it's always an hre no it is not it could be an hre it could be JavaScript it could be a whole bunch of other things it could be and it could be behind the scenes it could be something where it's actually a call back to the server and the server sends something back so there's a lot of different ways that that that can be handled which is why sometimes that's done on purpose so that you instead you're driven to the API side and then they say hey we're going to charge you x amount of money so you can subscribe to the API and then you can get going if you haven't looked at these I would or if this is new to you I would go look at scrapey S C A py and I think it's scrape.on I think there's a site with it as well uh that they have and it may be I forget what the other one they called but essentially it's just like so you can go do some scraping it'll help you sort of drag and drop but there's also some python behind it so that you can actually go see what it looks like the other thing I would do is pick just about anything go search for API for whatever your favorite tool is you know whether you're uh I don't know if like you use HubSpot or if you use Google go look at the Google apis or Amazon's apis there's pretty much everybody out there there are apis some are free and some are going to be the much better way to deal with it and then some are not because they may still be a better way to deal with it because you don't have to deal with all the headaches of that we've just talked about but then you also have to pay them for it because they're going to say hey we're going to make it easier for you we're going to make make it nice for you but hey you know share the love and so you're going to pay us whatever you need to pay thoughts on those so those are really good examples now there's one Edge case neither of us have touched on yet and that is embedded code or like iframes or essentially a page within a page if you're trying to do scraping that won't work if you're trying to stream the page and read the data through a stream the only way I've gotten around this is to actually physically uh download the complete contents of the page through like Firefox or Chrome with a plugin uh but only certain browsers even give you the functionality to get all that embed code with your source code of the page when you try to export it so that's another limitation with scraping is if you're trying to scrape pages that are actually loading data from other sites through an like an embedded uh plugin you might not be able to get it you might come into it and it's like oh it's not there you can't find it uh and that's just something else to think about as you're going through this uh one other thing to think about too is if you're doing from from the testing side or even if you're doing it from a scraping side uh look at selenium IDE it's a really cool free tool it's got plugins for just about every browser it's got a good desktop uh little application and What it lets you do is it basically you open up your browser open up selenium and you literally click through the page and as you click it records what you're doing and you can see if the elements you're working on on your page have IDs how is selenium seeing it is essentially going to tell you how hard is this project going to be to go scrape this page and a bonus to that is it's lend will while you you can export that in various languages so if you want to scrape if you want to have like the just a Brute Force scrape of data you can actually use selenium have it walk through all the steps generate that code out and it will actually do it for you you can just go run that in whatever your language is it it's got several that it does so let's say you know PHP or python or whatever and it will you can watch it open the browser up walk through all of that stuff do what it needs to do and then you know close the browser hopefully close the browser out otherwise hit close browser afterwards um but selenium is awesome selenium really is if you're if you're curious if you're wondering if this is something you can do then selenium is one that say that is probably the the fastest way to be able to get a page open it get data off of it and and be able to do that in a repetitive fashion now there are some you know there's some gotas to it and things like that but that really is like a excellent starting point for you and that's where you're going to see it all over the place anywhere that's anybody that's dealing with scraping in particular they're almost always going to mention Python and they're almost always going to mention selenium and selenium more so more so than anything else because that really is sort of the the industry standard I think for for like robotic behavior on a web page final thoughts yeah all I would like to say is for those of you getting into scraping if you haven't really done scraping before um like Rob said check out both the selenium scrap SC Scrappy Scrapy um and play with them uh it there's heck go to developing ner you know throw up uh sending ID or scrapey try to scrape our page some oh one thing we didn't touch on though is the other thing is some pages even offer RSS feeds which essentially are another way to scrape a page that we haven't really touched on but it is yet another source it's essentially another feeder that you can get like an API but it's public so you essentially hit a page here's all their data that you can digest and it's in a clean manner that's funny because as soon as you mention develop I like oh we forgot to mention RSS feeds RSS feeds just for those you guys that anybody that hasn't you probably have heard of RSS readers and such but you can go see the source for an RSS feed and they tend to be very well they're they're formatted documents they are it's very easy to crawl through them and get the data you need there are I've worked with many of those over the years um and used those actually as opposed to a scraper or even as opposed to an API because then for example we talked about I think it was like a couple episodes back now we talked about the uh and you guys have seen some of that if you if you've looked on the python side or actually I'm sorry it's the spring boot side where I built the um I think I talked about there anyways built the built the upwork integrator little app that goes out and grabs stuff off of upwork well I do that from an RSS feed because I can then take my specific search and it allows you to crank that out as an RSS feed and so then I can just hit that RSS feed pull stuff down and it's limited it's not going to give me everything it gives me I think 30 40 50 records at a time at something but I can hit that I can go through everything it's it's XML so it's very easy to parse very easy to pull out the pieces of information I want and then I don't have to deal with an API I don't have to connect with that I don't have to deal with scraping and it's pretty darn solid I mean RSS feeds you tend to be pretty stable and they're just going to be out there so if you wanted to for example have like be scraping the articles that we have put out on the developing order site you can go to and basically any WordPress like go to the site RSS or rss2 depending on how it is and then you can see a nice XML document that you can parse and do what you you know do whatever you want to with it other than obviously repurpose it unless you tell us and then then we're cool about that that being said I think we're going to wrap this one up so as always shoot us email at info@ developin word.com if you have any questions if you have suggestions if you have any anything like that any feedback we'd love to hear it you can check us out on YouTube at develop and or and just go out there and have yourself a great day a great week and we will talk to you next time all right thoughts that seem to go pretty well too I thought yeah no I like that in fact it gave me an idea and I'm going to throw it out here so when I'm listening to this again I'll remember to do it uh I'm going to go ahead and throw the links out there for uh Scrapy um kind of like RS there there's a plugin for that uh one for slum IDE and then what I may even do is I may even throw out a simple python script to just uh screen scrape a page yeah yeah if you need that I've got like a thousand of those so so we do that we really we we do that so often it's it's it's sort of funny as it's now it's become just a standard well that's that's what the guy I was talking to today it's a new customer and he's like hey can we you know he wants some data and it's like can we do that like yeah people do that all the time literally like we' have done that for all kinds of different verticals and yeah it's just we'll go grab it and yeah it's not if you're scraping it that's the only thing is it's not 100% it's that you'll get some weird stuff even we didn't even get into this we had something where we were it was you pull down a PDF and you would think pull down a PDF like XML or CSV or anything else like just pull it down scrape it and you're good PDFs even it was was amazing it was obviously the same uh generation engine it was a report it was a nice little PDF report that we were trying to find the data on and it was it would periodically it would just be totally off it was just we could we finally got I think we got it to like better than 98% hit rate but it would be stuff where it was just like it's like NOP it's they added another column there and it's like there shouldn't be but they just add another column there and it would be because of uh for example if somebody put Extra Spaces somewhere it would just decide that nope that's now another field that's another column and you really couldn't see it in the like if you looked at the PDF it looked fine you had to actually look at the the document behind the scenes you had to look at the Dom for that PDF to realize that oh they actually you know threw another column in there that's empty that has no value but that's kind of crap that we you know that you can run into yet another play way that you can get into scraping and integrating data and stuff like that we've we've done it all it feels like the last few years it's I that one what uh over a decade ago uh back when almost American old patient days but uh there was a guy there was a site out there for a while that had all the old uh tabletop documents for like old school Dungeons and Dragons BattleTech um Mech Warrior like all the old stuff that you can't get in anymore that's all discontinued and he had it all online all basically it was kind of like a um Dropbox but this was kind of before Dropbox was really popular and I just wor a little scraper that went out literally and it found all the links and it downloaded all the documents in the same file structure and I was able to uh kind of parse through it and find the ones I was looking for but I I couldn't find them through the site so I was like well here I'll just get them all and I'll figure it out later and uh yeah those are things you can do with these scrapers it's not just scraping data but you can actually download files you can interact with things you can pull in videos you can pull in images uh you know lots of things you can do with it I actually did that and this is bonus material so there you go um I had a customer that we Built Well I built out a site for her and we had I don't know 20 30 different pages and screens and at the end of the day she wanted to have a a picture she wanted a screenshot of every single screen as part of the the user documentation and I was like oh my gosh that's going to take a while but what I ended up doing is a screen scraper that just walked through everything and it was easy enough because I knew all the links I just I I just crawled the site initially and said boom take a every just go to each page snap a picture go to a page snap a picture and I think I had like three others that was like yeah I just went in and manually and just said added those couple of links and said go there and snap it and then suddenly had you know whatever it was the 40 or 50 images that she needed simple things like that it's like you it's not till you think about it and you're like oh wait you sneeze and then you go hey that's actually a better way to do it than to manually go through it it's just one of those things once you've done it a couple times you look at some of these things and it's like ah I really don't want to do that a hundred times but it would be maybe I can spend a few minutes and especially with some of these tools I can spend a little bit generate something that I can just replicate over and over again and have it do the work for me instead of me doing the you know going through that drudgery now it's funny uh that you mentioned that because on the testing side of things uh what you can do especially with the selenium web driver is you can walk through a s sitemap and you can actually uh use screenshot through web driver and it will take so like if if you want to do a screenshot in Chrome or Firefox you essentially load the driver for that browser and then do a snapshot of the page that you're on so if you're actually doing like U browser comparisons to see how does your site look on each browser uh what you can do is you do that and then what you do is you do a image compare and it will tell you oh hey this image does not match these and it it's kind of cool it very simple there's good examples out there for that but that's just another way that you can uh do that with some online tools that's a neat one we did that's like a little bit I'll say it's a what they what this customer is doing was a little bit Shady but it's legal but it's a little bit like it's it's borderline stuff and that was one of the things they did is this was um there a scraper is an automation tool and it was very sensitive to certain themes and some things like that and so what they did is they had a they had uh a crawler go through and it would just walk through this and take the picture of each of themes so then later they could go and say am I on this page and they could actually completely compare that image as opposed to you know trying to look for other stuff because they would they would look for Guess certain tags and those would change things would move the formats would change so instead it looked the same basically it was enough that theyd say okay if this is a you know 80% match then we're basically on the right page if it's not then we're you know on the wrong page and they would do that with like they did that even they went and grabbed U they do buttons so instead of trying to find the button control they would go see if that image existed somewhere in the the mapping and then they would go use that so they could figure out what the coordinates were so they could press the button that way I mean there's some there's some stuff like that it's like because the first I was like why do you have 18 pictures of the same button and they're all just a little bit different and it turned out it has to do with um it was the way they were spinning stuff up and it had to do with the display resolution and then also some of the the default color themes would show up and so yeah it's it's funny that kind of stuff that you you don't think about like why would you need to like take pictures as you're walking through a site but there are a lot of uses for such things well it's even interesting now too the last thing you mentioned there were they were taking the pictures and you know to see what was there well now like uh Amazon has it's not poly but they've got one of their libraries now where you can uh basically look for text on an image and then scrape the text from the image as well so not only can you you know scrape web pages but you can also scrape images on a page as well that's true that's one that I've seen some of it in the LA in recent years I don't know how well that works but I've seen a couple of people that a couple places that use it it seems to work for them pretty well and it probably is U yeah same thing ply I think is the speech to text but there it's something like that there is I was amazed at um some of the image processing stuff and this was gosh this like three or four years ago when I was going through all Amazon Services they had just opened a couple of those up and it was built for U really was built initially for augmented and virtual reality stuff but it was saying so I I could take a picture and I would it would you'd have a series of pictures you say show me the parrots and it would go find the parrots in each of those and he would it would it's not a picture of a parrot it would just like I want show me par and it would pull those out and so things like that it's just it's amazing the image processing stuff that they can do and how well they can actually search those now and that will probably end up being the next you know round of scraping is like hey I want to you know grab some of the text off of these images well the thing uh Apple's got I think Google does too I know we're getting a little long here but uh you can also translate text so I've actually been watching some shows and it's like oh what's that say and I take a picture of it and then I highlight the text and I say translate and it'll translate it it works for most languages um but it's I mean it's amazing where we're going with technology well that was I got introduced to that gosh now it's probably 10 15 years ago because I Google translate was like you can slap that on on any web page and there was something I was doing for somebody I don't remember who the customer was but they wanted it they needed it in like six different languages like I don't know six different languages I mean programming language is great but spoken languages no I don't know if I can do this and they were we were talking through some stuff and they were like going to go hire some people and rewrite you know just extract everything out into Strings and then convert you know convert all of it and translate it all and I I ended up playing around with the Google translate and you just put a little button there and it'll just like you just like tell it what language and boom it'll give you that language it was two lines of code and it wasn't perfect because it's just it was a you know just Brute Force translation but it was close enough for the languages they were using they're like yeah that works what we need you know that's what we need and then you we said okay if you a language that you need it you really want it and it doesn't translate then we can always go back and we can create that specific page for that language there's some things we can do but you stuff like that that gets you that 8020 rule if it can get you 80% of the there or better then why not I mean that's say some of that stuff it's it's almost free to use it or it is so why not just take that exactly all right I think we can call it a wrap we got a couple episode's done we'll come back next week we'll do it on Thursday and uh we just keep chugging along sounds good thanks again Rob for moving around and dealing with the weather and that um we're supposed to get the really bad storms tomorrow so uh I didn't think I wanted to take a chance to lose an internet during that that's good point and that's yeah it's sort of frustrating when we do so the rest of you guys goodbye have a good one and yourself we'll talk to you again next time and we'll just keep chugging along and I'm sure we'll have plenty to rant about next week as well [Music]
Transcript Segments
[Music]
similar thing um we've talked about this
I think we did this one time at a mentor
class way back I think there was a
presentation that was and it's basically
because I had a conversation today with
somebody that brought it up again I was
thinking this may be a good good topic
to cover is talking about in
particularly scraping versus API now
this isn't as much of a probably not as
much a technical thing although I guess
it is some extent because I I I think
too often that particularly when you're
getting started you'll you know when
you're not when you haven't done a lot
as a
developer you'll hear you know somebody
will say hey we need to scrape this data
and then you just like okay well it's on
a website we got to go to that website
we've got to find a way to crawl it and
pull that because because that's a cool
fun challenging thing to do or because I
just did that in my you know my software
Class A couple years ago so I sort of
know how to do it and it is one of those
things like once you know how to do it
it's like cool I can do it it's slow
it's tedious it's painful depending on
how well they use IDs and whether you
have to use CSS selectors or xath and
all that kind of crud but it's also so
fragile and even with um there's some AI
projects that are working on that to try
to make that less fragile but even with
that apis are so much better and there's
just like even the idea of like
importing and exporting data like just
file uploads and you know that kind of
stuff that's they are ways to integrate
and to move data from system to system
that I think people don't think about as
much and so that's what I want to talk a
bit about it's it's really going to be
more like we talk about apis and some of
the the the pros and cons of those
versus like just go through a scraping
thing and stuff like that um I see is
that sound of like seem like something
You' got some you have some thoughts on
as well yeah
because incidentally that's also the
problem with the web automation with the
web testing because essentially having
to scrape to find the ID is to in uh
basically to interact with the using the
web driver with the pages and one of the
arguments I have not necessarily about
screen scraping but about website design
in general is good website development
requires IDs on your any input fields or
any interactive action fields that you
have on a page if you're not doing that
you're essentially writing bad code and
I don't care how much AI you can put out
there you're never going to fix the
problem you have to write the code
correctly or it's just not going to work
yeah so I think that's and that's
exactly the place one of the things we
want to talk about is because that's
that's part of why scraping is such a
pain in the butt because most code is
not written that way sometimes
intentionally but I think we can talk a
little bit we can dig a little bit into
the
whole we get a little nerdy on that
about getting into a control and how do
you how do you navigate to that and pull
that information back out
so hello and welcome back we are here
for another episode of building better
developers also known as the developer
or podcast actually I think it was it
was developer or first and then it
became building better developers but it
doesn't that's like neither here nor
there I'm Rob one of the founders of
develop and or and across the uh the
digiverse or whatever it is is Michael
welcome and I want you introduce
yourself as well
hey everyone my name is Michael molash
co-founder of develop Nur and founder of
Envision
QA today this episode we are going to
talk about we really it's at a high
level we're going to talk about
integration but we're going to take a
step down and talk about like really the
differences between scraping versus
using apis or other methods there are
many ways that you may need to or want
to ingest data into a system and I want
to talk a little bit about those because
there is I too often they sort of just
there's like a broad brush approach and
either everybody thinks that you're just
at the most simple of like all we can do
is import you know we have to have a CSV
and import that and that's the only way
we get data in or the other end of the
only way we can do this is have this
really complex web scraper that goes out
and crawls all these sites and grabs all
this information and then of course as
soon as any that information changes on
the site it's broken and we have to
rewrite the code
neither of those are 100% correct
although they might be for you it
depends on where you're
at so I think what I want to start with
is the let's start on the The Far Side
of that is the challenges of scraping
and in particular uh this is something
that's although it is near and dear to
my heart because I've had to do this
with several projects it's maybe even
nearer in Deer or however it is to
Michaels because this something that he
does as part of his
code generation tool and so I think
we'll start with that is what are the
let's start with what let's talk a
little bit what makes a good page for
scraping for getting information back
off of that page from a program
programmatic scent not we're not talking
user experience stuff but we're talking
the backend
side so I guess one additional thing
maybe you can touch on too is you know
why what is web scraping why why do we
use it before I get into that because I
want to make sure listeners understand
the difference between web scraping and
interacting with
apis all right I will take that volley
back and I'll work with that one so
scraping I have heard people that have
actually I've come across people that
think all of these things are the same
so for our purposes we're going to talk
about scraping is and I think it's
probably the technically correct way
that you look at it scraping is when you
go to either a a it's basically go to a
user interface and you from a program
are trying to do the same thing a hum
would do it actually comes all the way
it goes back to the old um main frames
and they would screen scrape what they
would do is they would have something
that would pull the display basically
back and it would go you know count out
15 columns over here and three rows down
and grab that value and then go six
columns over here and two rows up from
that and then grab that value so it's
literally like looking at your screen in
that case the grid that it was because
it was a i it's usually like a 40 by 25
grid and it was like where do you go on
it to get this specific value and then
sometimes it would be you know you grab
three blocks in a row to get a value
that's a string that you know is you
know three characters in length that has
advanced into the web world of go to any
page the easiest way to like to to get a
feel for it is to go to any web page
with like pretty much any browser these
days rightclick and then hit inspect and
you're going to end up getting somewhere
there you're going to get an option to
like pull out a JavaScript or a view
Source kind of thing and if you this is
just if you don't know you know if you
don't know HTML on that what you're
going to see is you're going to see what
is a essentially a formatted document
there's all of these little tags and
there's these ways that you build out
that page behind the scenes so if think
of that page is all these little
controls or these little widgets however
you best organize them the goal with
scraping is like for example if I'm
going to a an input form that has first
name last name email phone number and I
want to go grab the phone number then I
can go look at that document format and
I know that if I go to the you know the
email input field I can grab that that
value off of it or vice versa I can put
a value into that
field and how you get to
that is in itself a little bit of a
journey because there are multiple ways
to do that there are multiple ways to
tackle the the format and how we
navigate our way into those pages but
that's the scraping side API side is and
I I I figured I had like an example
today that I think probably works the
best is that if you want to go look up
information about let's say Michael you
can go to his website you can go find
his about page and then you can go look
for stuff that says his name his email
phone number all that kind of stuff and
you can go find that and then you
personally would have to go look at it
find it on the screen and then write it
down somewhere or if you know him enough
you could just text him something and
say hey I need your your phone number
your email address and he will give you
back in a way that you can easily read
quickly you know his email and his phone
same thing people do that all the times
with email you send somebody says hey I
need your address they'll give you a
nice formatted little ADD address thing
that's sort of what an a an API is is
that instead of you having to go look
through all this stuff you say hey this
is the stuff I want this is a data I
want and the API gives that back to you
in a really nice format usually it's
either depending on how you do it it's
usually going to be Json or XML or some
different things like that depending on
what you're doing but the the bonus is
it's well formatted and it is it's not
guesswork if you're scraping you're
trying to figure out is this B data even
available in an API it is a there is a
contract there of I'm going to give you
this information and you're going to
give me back these fields and the data
that match those now I think I got a
little long winded so sorry about that
so I'll throw you back to you Michel and
let's see if see if you can do and it
say anything in a shorter method than I
just did no that was really good and
thank you for doing that because I I
just wanted to make sure that we were
kind of on the Baseline one other thing
to kind of mentioned here because if
you're dealing with apis like Rob said
you're kind of dealing with the contract
you're dealing with a controlled way of
passing data from your system to a a
requestor a user someone trying to get
information from you screen scraping is
essentially going out to any site and
just basically trying to take the data
that is being displayed on a web
page and in many cases
not many companies want you to go do
that uh because they want their data to
be protected where which is where apis
are a little more powerful they're a
little more controlled and you can
actually put um restrictions and
security on top of them to only allow
certain people access to your data
whereas screen scraping if it's on the
web you can get it uh which is where a
lot of your AI tools today are out there
just kind of reading a whole bunch of
data from so many different pages
they're essentially scraping those pages
from a developer's perspective if I'm
building a site for screen scraping the
site needs to follow basic web page or
Web building techniques every element on
your page that contains data that is
necessary for consumption should have an
ID maybe a name depending upon which uh
you know if you're using uh Json or
whatever your tool is but essentially ID
is the universal ID it has to be unique
for all your elements on your page you
should never have two duplicate
IDs so if your developers are following
best practices and building pages with
IDs then you should have no problem you
should be able to just say hey go read
the Dom read pull down basically or
download the web page run a script to
find all the IDS on the page boom there
you go you can go grab all the data off
the page real simple similar to an API
but an API you actually make a request
you get back Json XML which is very easy
to parse um but like I said if you do
the API and it returns XML if you're
doing a screen scrape and your page has
good IDs you essentially could do the
same thing it's essentially you're
reading an XML you're reading a Dom
document and the two could almost be
identical if it's done correctly one of
the biggest problems we have especially
from a testing perspective is not a lot
of developers are building web pages
with IDs in mind they're not thinking
about testers they're not thinking about
web scrapers they're just trying to get
the code after as quickly as possible
Sometimes using tools would drag and
drop which those are great but those
typically will generate a random ID for
the page now every time the page gets
rebuilt all those IDs change they all
randomly change so if you're doing that
well yeah your page has IDs but now oh
next time I try to run a scraper it all
fails because none of the IDS are the
same so these are some of the problems
and restrictions we have to look at when
building Pages we have to think how is
this data going to be consumed is it
just someone interacting with the web
page is it do I need to make sure this
is tested do I have a more complex
interface where people have to go in and
actually use this like um like a patient
portal or you know an inventory
management system those things should
have IDs because one if you're consuming
data if you're sending data back to a
system to be saved you need to have some
identifier some unique identifier and
that's how we send information back and
forth so we know that oh hey I got this
field this field is this data stored in
this location if you don't do that how
do you know what data you're
saving and that's I think that gets us
into the one of the big differences
between the two when you think about it
is if you from an API you can say hey I
want let's say I want a customer data I
can go in I can ask I can say here's a
customer ID give me the information back
and it'll say oh for customer ID one
here's their name here's their address
blah blah blah if you're doing the
scrape particularly if you if you've got
these random IDs and some of these
things or no IDs then what you're doing
is you're going to say okay I need to
First find a way to get to the search
page that and then figure out how do I
search for that customer and then when I
get to that customer then I have to go
find somewhere in the document where is
the name field where is the address
field and if I have IDs then it's really
just a matter of you just go find the
control with that ID and you go oh the
name the name field is ID 46 whatever it
is and here's a value and you can pull
it
out if you don't have that you end up
having to walk the Dom basically so you
say well if I go to the customer search
result page page and I can go to the
customer detail then I know that if I
get three divs down then I'm into the
customer information and then within
that one there's two divs over and now
I'm into the address information and
within that I can go into there's a you
know maybe there's an href in there so I
can grab a link and then I can do that
and so you're having to like you're
having to walk the Dom just like you
know think about indexing anything you
either can walk your way all the way
through like if you want to search for a
record in database you can look at each
record go is it is it is it is it or if
you have an ID you can jump right to
that row and that's what IDs will give
you and so if you want something that is
crawlable that is scrapable I guess
they're always going to be crawlable
which let me talk about that for a
second crawling is really just baby step
scraping instead of getting into the
details and trying to pull data out all
you're doing is going into that document
and try to find anything in there that
would link you to to another page and
then go follow that link now you may say
well that's a piece of cake it's always
an hre no it is not it could be an hre
it could be JavaScript it could be a
whole bunch of other things it could be
and it could be behind the scenes it
could be something where it's actually a
call back to the server and the server
sends something back so there's a lot of
different ways that that that can be
handled which is why sometimes that's
done on purpose so that you instead
you're driven to the API side and then
they say hey we're going to charge you x
amount of money so you can subscribe to
the API and then you can get going if
you haven't looked at these I would or
if this is new to you I would go look at
scrapey S C A py and I think it's
scrape.on I think there's a site with it
as well uh that they have and it may be
I forget what the other one they called
but essentially it's just like so you
can go do some scraping it'll help you
sort of drag and drop but there's also
some python behind it so that you can
actually go see what it looks like the
other thing I would do is pick just
about anything go search for API for
whatever your favorite tool is you know
whether you're uh I don't know if like
you use HubSpot or if you use Google go
look at the Google apis or Amazon's apis
there's pretty much everybody out there
there are apis some are free and some
are going to be the much better way to
deal with it and then some are not
because they may still be a better way
to deal with it because you don't have
to deal with all the headaches of that
we've just talked about but then you
also have to pay them for it because
they're going to say hey we're going to
make it easier for you we're going to
make make it nice for you but hey you
know share the love and so you're going
to pay us whatever you need to pay
thoughts on
those so those are really good examples
now there's one Edge case neither of us
have touched on yet and that is
embedded code or like iframes or
essentially a page within a page if
you're trying to do scraping that won't
work if you're trying to stream the page
and read the data through a stream the
only way I've gotten around this is to
actually physically uh download the
complete contents of the page through
like Firefox or Chrome with a plugin uh
but only certain browsers even give you
the functionality to get all that embed
code with your source code of the page
when you try to export it so that's
another limitation with scraping is if
you're trying to scrape pages that are
actually loading data from other sites
through an like an embedded uh plugin
you might not be able to get it you
might come into it and it's like oh it's
not there you can't find it uh and
that's just something else to think
about as you're going through this uh
one other thing to think about too
is if you're doing from from the testing
side or even if you're doing it from a
scraping side uh look at selenium IDE
it's a really cool free tool it's got
plugins for just about every browser
it's got a good desktop uh little
application and What it lets you do is
it basically you open up your browser
open up selenium and you literally click
through the page and as you click it
records what you're doing and you can
see if the elements you're working on on
your page have IDs how is selenium
seeing it is essentially going to tell
you how hard is this project going to be
to go scrape this page and a bonus to
that is it's lend will while you you can
export that in various languages so if
you want to scrape if you want to have
like the just a Brute Force scrape of
data you can actually use selenium have
it walk through all the steps generate
that code out and it will actually do it
for you you can just go run that in
whatever your language is it it's got
several that it does so let's say you
know PHP or python or whatever and it
will you can watch it open the browser
up walk through all of that stuff do
what it needs to do and then you know
close the browser hopefully close the
browser out otherwise hit close browser
afterwards um but selenium is awesome
selenium really is if you're if you're
curious if you're wondering if this is
something you can do then selenium is
one that say that is probably the the
fastest way to be able to get a page
open it get data off of it and and be
able to do that in a repetitive fashion
now there are some you know there's some
gotas to it and things like that but
that really is like a excellent starting
point for you and that's where you're
going to see it all over the place
anywhere that's anybody that's dealing
with scraping in particular they're
almost always going to mention Python
and they're almost always going to
mention selenium and selenium more so
more so than anything else because that
really is sort of the the industry
standard I think for for like robotic
behavior on a web page final
thoughts yeah all I would like to say is
for those of you getting into scraping
if you haven't really done scraping
before um like Rob said check out both
the selenium scrap SC Scrappy Scrapy um
and play with them uh it there's heck go
to developing ner you know throw up uh
sending ID or scrapey try to scrape our
page some oh one thing we didn't touch
on though is the other thing is some
pages even offer RSS feeds which
essentially are another way to scrape a
page that we haven't really touched on
but it is yet another source it's
essentially another feeder that you can
get like an API but it's public so you
essentially hit a page here's all their
data that you can digest and it's in a
clean manner that's funny because as
soon as you mention develop I like oh we
forgot to mention RSS feeds RSS feeds
just for those you guys that anybody
that hasn't you probably have heard of
RSS readers and such but you can go see
the source for an RSS feed and they tend
to be very well they're they're
formatted documents they are it's very
easy to crawl through them and get the
data you need there are I've worked with
many of those over the years um and used
those actually as opposed to a scraper
or even as opposed to an API because
then for example we talked about I think
it was like a couple episodes back now
we talked about
the uh and you guys have seen some of
that if you if you've looked on the
python side or actually I'm sorry it's
the spring boot side where I built the
um I think I talked about there anyways
built the built the upwork integrator
little app that goes out and grabs stuff
off of upwork well I do that from an RSS
feed because I can then take my specific
search and it allows you to crank that
out as an RSS feed and so then I can
just hit that RSS feed pull stuff down
and it's limited it's not going to give
me everything it gives me I think 30 40
50 records at a time at something but I
can hit that I can go through everything
it's it's XML so it's very easy to parse
very easy to pull out the pieces of
information I want and then I don't have
to deal with an API I don't have to
connect with that I don't have to deal
with scraping and it's pretty darn solid
I mean RSS feeds you tend to be pretty
stable and they're just going to be out
there so if you wanted to for example
have like be scraping the articles that
we have put out on the developing order
site you can go to and basically any
WordPress like go to the site RSS or
rss2 depending on how it is and then you
can see a nice XML document that you can
parse and do what you you know do
whatever you want to with it other than
obviously repurpose it unless you tell
us and then then we're cool about that
that being said I think we're going to
wrap this one up so as always shoot us
email at info@ developin word.com if you
have any questions if you have
suggestions if you have any anything
like that any feedback we'd love to hear
it you can check us out on YouTube at
develop and or and just go out there and
have yourself a great day a great week
and we will talk to you next
time all right thoughts that seem to go
pretty well too I thought yeah no I like
that in fact it gave me an idea and I'm
going to throw it out here so when I'm
listening to this again I'll remember to
do it uh I'm going to go ahead and throw
the links out there for uh
Scrapy um kind of like RS
there there's a plugin for that uh one
for slum IDE and then what I may even do
is I may even throw out a simple python
script to just uh screen scrape a page
yeah yeah if you need that I've got like
a thousand of those so so we do that we
really we we do that so often it's it's
it's sort of funny as it's now it's
become just a standard well that's
that's what the guy I was talking to
today it's a new customer and he's like
hey can we you know he wants some data
and it's like can we do that like yeah
people do that all the time literally
like we' have done that for all kinds of
different verticals and yeah it's just
we'll go grab it and yeah it's not if
you're scraping it that's the only thing
is it's not 100% it's that you'll get
some weird stuff even we didn't even get
into this we had something where we were
it was you pull down a PDF and you would
think pull down a PDF like XML or CSV or
anything else like just pull it down
scrape it and you're good PDFs even it
was was amazing it was obviously the
same uh generation engine it was a
report it was a nice little PDF report
that we were trying to find the data on
and it was it would periodically it
would just be totally off it was just we
could we finally got I think we got it
to like better than 98% hit rate but it
would be stuff where it was just
like it's like NOP it's they added
another column there and it's like there
shouldn't be but they just add another
column there and it would be because of
uh for example if somebody put Extra
Spaces somewhere it would just decide
that nope that's now another field
that's another column and you really
couldn't see it in the like if you
looked at the PDF it looked fine you had
to actually look at the the document
behind the scenes you had to look at the
Dom for that PDF to realize that oh they
actually you know threw another column
in there that's empty that has no value
but that's kind of crap that we you know
that you can run into yet another play
way that you can get into scraping and
integrating data and stuff like that
we've we've done it all it feels like
the last few years it's I that one what
uh over a decade ago uh back
when almost American old patient days
but uh there was a guy there was a site
out there for a while that had all the
old uh tabletop documents for like old
school Dungeons and Dragons BattleTech
um Mech Warrior like all the old stuff
that you can't get in anymore that's all
discontinued and he had it all online
all basically it was kind of like a
um Dropbox but this was kind of before
Dropbox was really popular and I just
wor a little scraper that went out
literally and it found all the links and
it downloaded all the documents in the
same file structure and I was able to uh
kind of parse through it and find the
ones I was looking for but I I couldn't
find them through the site so I was like
well here I'll just get them all and
I'll figure it out later and uh yeah
those are things you can do with these
scrapers it's not just scraping data but
you can actually download files you can
interact with things you can pull in
videos you can pull in images uh you
know lots of things you can do with it I
actually did that and this is bonus
material so there you go um I had a
customer that we Built Well I built out
a site for her and we had I don't know
20 30 different pages and screens and at
the end of the day she wanted to have a
a picture she wanted a screenshot of
every single screen as part of the the
user documentation and I was like oh my
gosh that's going to take a while but
what I ended up doing is a screen
scraper that just walked through
everything and it was easy enough
because I knew all the links I just I I
just crawled the site initially and said
boom take a every just go to each page
snap a picture go to a page snap a
picture and I think I had like three
others that was like yeah I just went in
and manually and just said added those
couple of links and said go there and
snap it and then suddenly had you know
whatever it was the 40 or 50 images that
she needed simple things like that it's
like you it's not till you think about
it and you're like oh
wait you sneeze and then you go hey
that's actually a better way to do it
than to manually go through it it's just
one of those things once you've done it
a couple times you look at some of these
things and it's like ah I really don't
want to do that a hundred times but it
would be maybe I can spend a few minutes
and especially with some of these tools
I can spend a little bit generate
something that I can just replicate over
and over again and have it do the work
for me instead of me doing the you know
going through that drudgery now it's
funny uh that you mentioned that because
on the testing side of things uh what
you can do especially with the selenium
web driver is you can walk through a s
sitemap and you can actually uh use
screenshot through web driver and it
will take so like if if you want to do a
screenshot in Chrome or Firefox you
essentially load the driver for that
browser and then do a snapshot of the
page that you're on so if you're
actually doing like U browser
comparisons to see how does your site
look on each browser uh what you can do
is you do that and then what you do is
you do a image compare and it will tell
you oh hey this image does not match
these and it it's kind of cool it very
simple there's good examples out there
for that but that's just another way
that you can uh do that with some online
tools that's a neat one we did that's
like a little
bit I'll say it's a what they what this
customer is doing was a little bit Shady
but it's legal but it's a little bit
like it's it's borderline stuff and that
was one of the things they
did is this was
um there a scraper is an automation tool
and it was very sensitive to certain
themes and some things like that and so
what they did is they had a they had uh
a crawler go through and it would just
walk through this and take the picture
of each of themes so then later they
could go and say am I on this page and
they could actually completely compare
that image as opposed
to you know trying to look for other
stuff because they would they would look
for Guess certain tags and those would
change things would move the formats
would change so instead it looked the
same basically it was enough that theyd
say okay if this is a you know 80% match
then we're basically on the right page
if it's not then we're you know on the
wrong page and they would do that with
like they did that even they went and
grabbed U they do buttons so instead of
trying to find the button control they
would go see if that image existed
somewhere in the the mapping and then
they would go use that so they could
figure out what the coordinates were so
they could press the button that way I
mean there's some there's some stuff
like that it's like because the first I
was like why do you have 18 pictures of
the same button and they're all just a
little bit different and it turned out
it has to do with um it was the way they
were spinning stuff up and it had to do
with the display resolution and then
also some of the the default color
themes would show up and so yeah it's
it's funny that kind of stuff that you
you don't think about like why would you
need to like take pictures as you're
walking through a site but there are a
lot of uses for such things well it's
even interesting now too the last thing
you mentioned there were they were
taking the pictures and you know to see
what was there well now like uh Amazon
has it's not poly but they've got one of
their libraries now where you can uh
basically look for text on an image and
then scrape the text from the image
as well so not only can you you know
scrape web pages but you can also scrape
images on a page as well that's true
that's one that I've seen some of it in
the LA in recent years I don't know how
well that works but I've seen a couple
of people that a couple places that use
it it seems to work for them pretty well
and it probably is U yeah same thing ply
I think is the speech to text but there
it's something like that there is I was
amazed at um some of the image
processing stuff and this was gosh this
like three or four years ago when I was
going through all Amazon Services they
had just opened a couple of those up and
it was built for U really was built
initially for augmented and virtual
reality stuff but it was saying so I I
could take a picture and I would it
would you'd have a series of pictures
you say show me the parrots and it would
go find the parrots in each of those and
he would it would it's not a picture of
a parrot it would just like I want show
me par and it would pull those out and
so things like that it's just it's
amazing the image processing stuff that
they can do and how well they can
actually search those now and that will
probably end up being the next you know
round of scraping is like hey I want to
you know grab some of the text off of
these images well the thing uh Apple's
got I think Google does too I know we're
getting a little long here but uh you
can also translate text so I've actually
been watching some shows and it's like
oh what's that say and I take a picture
of it and then I highlight the text and
I say translate and it'll translate it
it works for most languages um but it's
I mean it's amazing where we're going
with technology well that was I got
introduced to that gosh now it's
probably 10 15 years ago because I
Google translate was like you can slap
that on on any web page and there was
something I was doing for somebody I
don't remember who the customer was but
they wanted it they needed it in like
six different languages like I don't
know six different languages I mean
programming language is great but spoken
languages no I don't know if I can do
this and they were we were talking
through some stuff and they were like
going to go hire some people and rewrite
you know just extract everything out
into Strings and then convert you know
convert all of it and translate it all
and I I ended up playing around with the
Google translate and you just put a
little button there and it'll just like
you just like tell it what language and
boom it'll give you that language it was
two lines of code and it wasn't perfect
because it's just it was a you know just
Brute Force translation but it was close
enough for the languages they were using
they're like yeah that works what we
need you know that's what we need and
then you we said okay if you a language
that you need it you really want it and
it doesn't translate then we can always
go back and we can create that specific
page for that language there's some
things we can do but you stuff like that
that gets you that 8020 rule if it can
get you 80% of the there or better then
why not I mean that's say some of that
stuff it's it's almost free to use it or
it is so why not just take that
exactly all right I think we can call it
a wrap we got a couple episode's done
we'll come back next week we'll do it on
Thursday and uh we just keep chugging
along sounds good thanks again Rob for
moving around and dealing with the
weather and that um we're supposed to
get the really bad storms tomorrow so uh
I didn't think I wanted to take a chance
to lose an internet during that that's
good point and that's yeah it's sort of
frustrating when we do so the rest of
you guys goodbye have a good one and
yourself we'll talk to you again next
time and we'll just keep chugging along
and I'm sure we'll have plenty to rant
about next week as well
[Music]