You Need to Stop Scraping Open Data, Now!
In this guest post by Daniel Cave, a web scraping insider, formerly of two of the worlds most prominent scripting platforms, import.io and Diffbot, we cover the topic of why you might scrape open data, and why these days it’s not necessarily the best option for most companies.
Looking back into the deep dark days of the early web Before the advent of open Apis Schema.org and RDF, and definitely, well before Artificial intelligence, visual web scraping was one of the very few ways to get structured data out of the web. It was (and still is in many cases) a hard and thankless task requiring large teams of developers and infrastructure to crawl, parse and extract data from URLs.
Fast forward to today, and because of the adoption of data-driven practices and the ability to use data to drive positive business outcomes, there is more demand for data than ever… and with that there are now a number of easy to use libraries, tools and services that make web scraping comparatively simple. So what’s the issue?
I’d like to draw a simple analogy of a chef baking a cake. It doesn’t make sense for the chef to churn their own butter or refine their own sugar; It’s not central to their business, and it doesn’t add anything to the creation of the final product… that being the case why would they commit to the overhead and not just buy the ingredients from a supplier?
Specialist, dedicated suppliers will deliver consistently higher quality products, while assuming all the overheads risks, and issues associated with the creation of their product. It’s the foundation of the supply chains that drive the global economy, the same now applies to open data.
With the arrival of specialist open data suppliers, it doesn’t make sense for companies to invest in building and maintaining a scraping infrastructure; It’s pure overhead and a risk you don’t need and shouldn’t be taking. That alone should be enough to convince any business to move away from scaping open data, to using a supplier, but there is so much more to it.
Here are some other reasons you need to stop scraping open data:
- You’ll struggle to keep the data fresh with so many sources
- Matching Algorithms are time-consuming and complex
- Integrating and federating scraped data is hard
- You can work to fixed budgets and deadlines
- Suppliers can invest in innovation where you can’t
- Data Laws like GDPR is becoming more complex and hard to abide by
So how/where do you get open data from then?
All the large data incumbents already use open data, how they won’t give you access to the raw files as they make their money by adding value on top with customer segments and company credit scores. If you think of open data as a commodity then some of the important things to consider are:
- How easy is that data to access, use and integrate?
- Has the data been cleaned and is it business ready?
- Is your supplier GDPR compliant?
- How often is their refreshed?
- Is the data matched correctly?
- Do they supply the data opinion free? (i.e. as observed)
Everybody knows the old saying: “80% of the data scientists time is spent cleaning data”. Everybody also knows data science is not cheap… so you need to be smart when choosing data suppliers. Make sure their data is clean and packaged well, and that it’s available in a single queryable and federated platform, ensure your team is also not wasting time building unnecessary data structures.
For all the above reasons I don’t hesitate in recommending Doorda as a replacement for web scraping, whilst scraping can be good for tracking competitors or checking prices… Open Data isn’t one of them.