Opinion

You Need to Stop Scraping Open Data, Now!

Editor

In this guest post by Daniel Cave, a web scraping insider, formerly of two of the worlds most prominent scripting platforms, import.io and Diffbot, we cover the topic of why you might scrape open data, and why these days it’s not necessarily the best option for most companies.

Looking back into the deep dark days of the early web Before the advent of open Apis Schema.org and RDF, and definitely well before Artificial intelligence, visual web scraping was one of the very few ways to get structured data out of the web. It was (and still is in many cases) a hard and thankless task requiring large teams of developers and infrastructure to crawl, parse and extract data from URLs.

Fast forward to today, and because of the adoption of data-driven practices and the ability to use data to drive positive business outcomes, there is more demand for data than ever… and with that there are now a number of easy to use libraries, tools and services that make web scraping comparatively simple. So what’s the issue?

I’d like to draw a simple analogy of a chef baking a cake. It doesn’t make sense for the chef to churn their own butter or refine their own sugar; It’s not central to their business, and it doesn’t add anything to the creation of the final product… that being the case why would they commit to the overhead and not just buy the ingredients from a supplier?

Specialist, dedicated suppliers will deliver consistently higher quality products, while assuming all the overheads risks, and issues associated with the creation of their product. It’s the foundation of the supply chains that drive the global economy, the same now applies to open data.

With the arrival of specialist open data suppliers, it doesn’t make sense for companies to invest in building and maintaining a scraping infrastructure; It’s pure overhead and a risk you don’t need and shouldn’t be taking. That alone should be enough to convince any business to move away from scaping open data, to using a supplier, but there is so much more to it.

Here are some other reasons you need to stop scraping open data:

  • You’ll struggle to keep the data fresh with so many sources
  • Matching Algorithms are time-consuming and complex
  • Integrating and federating scraped data is hard
  • You can work to fixed budgets and deadlines
  • Suppliers can invest in innovation where you can’t
  • Data Laws like GDPR is becoming more complex and hard to abide by

So how/where do you get open data from then?

All the large data incumbents already use open data, how they won’t give you access to the raw files as they make their money by adding value on top with customer segments and company credit scores. If you think of open data as a commodity then some of the important things to consider are:

  • How easy is that data to access, use and integrate?
  • Has the data been cleaned and is it business ready?
  • Is your supplier GDPR compliant?
  • How often is their refreshed?
  • Is the data matched correctly?
  • Do they supply the data opinion free? (i.e. as observed)

Everybody knows the old saying: “80% of the data scientists time is spent cleaning data”. Everybody also knows data science is not cheap… so you need to be smart when choosing data suppliers. Make sure their data is clean and packaged well, and that it’s available in a single queryable and federated platform, ensure your team is also not wasting time building unnecessary data structures.

For all the above reasons I don’t hesitate in recommending Doorda as a replacement for web scraping, whilst scraping can be good for tracking competitors or checking prices… Open Data isn’t one of them.

×

Membership Level

You have selected the Free Level membership level.

The price for membership is £0.00 now.

Create your FREE account for access to this and ongoing data updates.


Account information Already have an account? Log in here

LEAVE THIS BLANK

More Information


Terms and Conditions

1. Acceptance the Use Of Doorda.com Terms and Conditions

Your access to and use of Doorda.com (the Website) , the property of Doorda Ltd (“Doorda”), is subject exclusively to these Terms and Conditions. You will not use the Website for any purpose that is unlawful or prohibited by these Terms and Conditions. By using the Website you are fully accepting these including any disclaimers stated therein . If you do not accept these Terms and Conditions you must immediately stop using the Website.

2. Credit card details

Doorda will never ask for Credit Card details and requests that you do not enter it on any of the forms on the Website

3. Advice

The contents of the Website do not constitute advice and should not be relied upon in making or refraining from making, any decision.

4. Non-Commercial Use

All content on the Website is for non-commercial use only and is issued under the Creative Commons Attribution-Non Commercial 4.0 International Licence .

5. Change of Use

Doorda reserves the right to:

(1) Change or remove (temporarily or permanently) the Website or any part of it without notice and you confirm that Doorda shall not be liable to you for any such change or removal
(2) Change these Terms and Conditions at any time, and your continued use of the Website following any changes shall be deemed to be your acceptance of such change.
and
(3) Discontinue access by any user at any time , such action being entirely at Doorda’s discretion

6. Links to Third Party Websites

The Website may include links to third party websites that are controlled and maintained by others. Any link to other websites is not an endorsement of such websites and you acknowledge and agree that we are not responsible for the content or availability of any such sites.

7. Intellectual Property

7.1 All copyright, trademarks and all other intellectual property rights in the Website and its content (including without limitation the Website design, text, graphics and all software and source codes connected with the Website) are owned by or licensed to Doorda or otherwise used by Doorda as permitted by law.

7.2 In accessing the Website you agree that you will access the content solely for your personal, non-commercial use. None of the content may be downloaded, copied, reproduced, transmitted, stored, sold or distributed without the prior written consent of the copyright holder. This excludes the downloading, copying and/or printing of pages of the Website for personal, non-commercial home use only.

8. Disclaimers and Limitation of Liability

8.1 The Website is provided on an AS IS and AS AVAILABLE basis without any representation or endorsement made and without warranty of any kind whether express or implied, including but not limited to the implied warranties of satisfactory quality, fitness for a particular purpose, non-infringement, compatibility, security and accuracy.

8.2 To the extent permitted by law, Doorda will not be liable for any indirect or consequential loss or damage whatever (including without limitation loss of business, opportunity, data, profits) arising out of or in connection with the use of the Website.

8.3 Doorda makes no warranty that the functionality of the Website will be uninterrupted or error free, that defects will be corrected or that the Website or the server that makes it available are free of viruses or anything else which may be harmful or destructive.

8.4 Nothing in these Terms and Conditions shall be construed so as to exclude or limit the liability of Doorda for death or personal injury as a result of the negligence of Doorda or that of its employees or agents.

9. Indemnity

You agree to indemnify and hold Doorda and its employees and agents harmless from and against all liabilities, legal fees, damages, losses, costs and other expenses in relation to any claims or actions brought against Doorda arising out of any breach by you of these Terms and Conditions or other liabilities arising out of your use of this Website.

10. Severance

If any of these Terms and Conditions should be determined to be invalid, illegal or unenforceable for any reason by any court of competent jurisdiction then such Term or Condition shall be severed and the remaining Terms and Conditions shall survive and remain in full force and effect and continue to be binding and enforceable.

11. Governing Law

These Terms and Conditions shall be governed by and construed in accordance with the laws of England and you hereby submit to the exclusive jurisdiction of the English courts.

Click here to read full terms of use. Your information will be used to log you in to the website and subscribe you to our newsletter when selected. We will only send you relevant information and you can unsubscribe at any time. View our Privacy Policy

Join our mailing list.