Amelia Winstead | Blogcrawlers

All things programmatically creepy crawly

I'm going to get a bit technical here for a moment so bear with me.

In case you didn't know the difference, a crawler is a bit of code that will connect to a website and pull its data off the page. Sometimes it will pull everything, sometimes it will look for specific pieces of data. Sometimes, it will even login and authenticate with provided credentials. A spider, by contrast, will both crawl pages and store the data it pulls, usually in a database but sometimes in a file. While neither one is illegal, you will want to be careful if using a spider and delivering the content back somewhere else, since certain pieces of content (images, articles, etc) may be copyright depending on their source. Or you can be reckless and wait for a cease and desist to show up but I find that distasteful. Because while the actual crawling and storage of data is not illegal, serving up someone else's copyrighted content is.

Spiders go by a lot of names, including scraping, and data-mining. I prefer those names since in-real-life spiders creep me the hell out. So for the rest of this article, I'll be referring to them as data-mining or crawlers but please understand that I am being non-technical and non-literal and intentionally using loose terminology to avoid using a word I don't like. Deal with it.

So what might be a use case for a crawler? If you've ever visited the wayback machine, you already have some idea what they're capable of. You can use a crawler to pull an entire pages worth of data and then redeliver it, or more commonly, you can use it to pull specific data from a page and then format, rerender or build functions out of it. I've built a few, and some of the use cases involved getting election results ASAP from an unweildly government or AP website and making them palatable for customers, building myself a little smart search and notify script for finding the perfect dog at the perfect price off multiple rescue and adoption websites(spoiler alert, it worked!), or building an application that sifted through huge amounts of text-based data (road conditions) and regex matched for specific fragments where the actual road quality, traffic accidents/delays and construction were. These are just a few potential uses for crawlers I've written, I am sure there are thousands.

But, much like anything in code, a crawler is only as good as its author. A lot of people hate regex, so they refuse to use it. I think this is a tad short-sighted, since regex can do a lot of powerful things when you combine it with a crawler. A lot of people also hate working with things like curl, in favor of prebuilt ajax niceties. I have no beef with this, since they can both achieve nearly the same results, but sometimes curl is a lot easier to handle than ajax, specifically when working with baking authentication into your crawler.

I'm going to discuss my dog-finder crawler in detail, since its probably the most basic one I built of the ones I listed, and its a great example of how a crawler can save you literal hours of time every day. This was a very specific use case, but you'll be able to see pretty easily how it could be extrapolated out for other functions.

Anyone who has a dog knows several things: dogs are amazing, far better pets than cats, they can be quite expensive if you get them from the wrong place or person, and they can also cause a great deal of trouble if you get one that isn't a good fit for your household. For example, I'm a bit messy (not a slob, but I typically only clean house once or twice a week) which means that any breed tall enough to grab things off the kitchen countertop would not survive long in my house because of my addiction to chocolate. I wanted to get a female dog, because until they're fixed, male dogs are more difficult to potty train since they want to mark everywhere, and I wanted to get a dog that was an adorable breed I liked (or some mix thereof). I don't think dogs with squished snouts are very cute, and I don't like bully-breeds. I also wanted to get a puppy or young dog, so that it could be with me for a long time and I could have an easier time training it (it was going to be my first dog on my own!). I also wanted to spend less than $500 on the purchase of this dog, since my job at the time gave me pitiful (and shamefully female-biased) salary. I also very much wanted a rescue, which is hard to find with puppies. I also did not want a teacup/toy dog, since they are prone to health issues.

So right there, we have quite a few criteria that can be boiled down to search input. Under 1 year old, female, specific breeds only, small to medium size, but not a toy, and under $500. If you've ever tried to search the internet for a dog, you know there are dozens of dog rescue, sale and adoption websites. Craigslist, petfinder, aspca, petango and on and on. And the difficult part is that most, if not all of them, have different dogs listed on them daily. You have to manually visit every website, put in your search criteria (assuming they have a search filtering function, not all of them do), spend an hour or two on each site combing through their paginated or lazy-loading list of dogs, and if you are lucky enough to find one that might be a good fit for you, it might already be gone by the time you find it. Spending 4-5 hours a day combing through these sites wasn't working for me.

Enter crawler. I set it up first to make a curl request to the URL for each of those sites. Some of them had filtering functions that worked off $_GET parameters, which was A+ awesome for my crawler. Other ones had $_POST filter requests, also pretty simple, and still others required you to log in or provide an email before you could search. I set it up so that whichever site it was, it got the filters as correct as possible and the authentication it needed to access its search engines. Next up, I needed to regex comb through the search returns since not every site had a functional search engine. Believe it or not, a lot of the searches will return male dogs for female dog searches, or breeds you aren't looking for. Some even ignored pricing filters. The life of a programmer is hard sometimes, usually when it involves other (typically 3rd party) programmers. So I built specific regex for each site's returned content, that would eliminate any listings that didn't look like what I was after. I added in a few headers to denote which site the listing was coming from so I could find the dogs quickly (if no active links were provided on the actual listings) as well. Once I had some semi-coherent HTML that only showed me the dogs I was interested in, the next step was getting it set up to notify me when a new one showed up.

The email part was easy, just dump the html into a php mail function and bam you're getting an hourly email of all the dogs in your pool. The detection for new dogs in comparison to the previous crawl was a bit trickier, it involved storing the html into a private database (side note, DO NOT do this unless your project is completely private...as HTML requires characters that are an injection risk) and pulling that data for comparison every time a site was crawled. I was super lazy with this bit of the code, and it basically just did a length check to see if the str count was larger between blogs than before. If it wasn't, don't email me. If it was, then yay! A new dog I might love had been added so get that email out asap.

Most of the stuff I built for this was running on wamp, but I set up a command line script that I could activate when I left for work each morning so that if my perfect dog was available, I'd be able to take my 15m break and make the call to get her before anyone else could. Its a dog eat dog world. Especially when it comes to cute rescue puppies. For the record, I believe all dogs deserve to have a happy home, and that there is a perfect dog for everyone. That said, there is spectrum of dogs that you can cohabitate with and a spectrum of dogs you wouldn't want to. Some people don't like dogs that are too small since they get under foot, some people don't like dogs that are too hyper because they just want a lap dog, some people are allergic to hairier breeds. Some people are indiscriminate with their love of dogs.

I am pretty discerning with the dogs I bring into my household, mostly because I've owned a lot of different breeds and have developed some preferences. The dog I ended up getting was a terrier mutt mix (Aussie shephard? Daushaund?) named Lulu for $200, who was listed under the wrong breed (craigslist had her down as a pomchi). She is still fearful of most new humans (and larger dogs) because of her abuse as a puppy, but she is practically a genius on the dog-intelligence scale and she is a delight to be around once she knows you. I've had her for about 2 years now, and I love her to pieces. I got her trained in under 2 months (all basic commands and potty), and she is as happy as a clam to be able to chase the squirrels in my backyard or sleep all day on my couch. The moral of this story is that without the crawler, I might not have known she existed in time to get her, and my home would definitely be poorer for that.

I was also able to use this crawler to get my second dog, a little puffball named Truffles. I'll likely use it to find my 3rd, 4th, and so on dogs for the rest of my life, since it seems quite effective. It has likely saved me dozens of hours in search-effort, and for my time, that's a hefty price to pay. I'd have gladly paid it for these two, but I'm glad that I didn't have to.