You Can Always Extract Something From Scraping

by Alvaro Febrel

More often than not, startups are ignored by big companies when seeking help. To make things worse, sometimes a startup’s business model relies on established players to succeed. “Data Aggregators” are a good example of this, as they use other companies’ data to provide new services for customers (Tripadvisor, Kayak, Cake Financial and Chegg are examples of companies that in one way or another aggregate data from different sources). The issue is, what can a data aggregator do if its sources of information don’t collaborate?  The answer: scrape!

The term “Screen Scraping” is used to describe software that reads and extracts information from data that was intended for display to an end-user, as opposed to reading and extracting the same information from “non-manipulated/machine-oriented” data. Screen scraping is usually considered an inefficient way to get information, as it depends on how the end-user output is displayed. For instance, if you want to know today’s oil price, you could i) connect to a broker database and extract the value from there or b) create a program that logs into the broker’s website, looks up for the commodity, and reads the number that is placed in a graph, for example. If the broker changes the layout of its graph or website, we would have to tweak the program to extract the correct value.

Despite being a buggy solution, I believe data scraping can be very useful and could be a lean way to go when:
  1. There is low fragmentation in the market or most of it is concentrated with a few players - Using the 80-20 rule, scraping would give us a fast foothold in the market.
  2. The industry is stable and not evolving too fast - Otherwise, the likelihood of having to constantly patch your code would increase.
  3. There is little multi-homing - Customers who use multiple sources of information would demand them to be aggregated in the same place and hence it increases the number of sites we would need to scrape, which becomes inefficient.
  4. The market has network effects that would fuel virality- Scraping would lower the customer acquisition costs and solve the classic chicken-and-egg problem that many platforms face by “acquiring” cheaply one side of the platform. When the other side of the platform buys in, then the company could “pivot” and would have bargaining power to negotiate a more seamless data supply. 

I tried to test these hypotheses through the lens of 2 companies that used scraping: Cake Financial (website for consumers seeking to improve their investment portfolio performance) and Chegg (website for student book rentals)

Cake Financial
Low Fragmentation, High Concentration?
My guess for this is that the market was fairly fragmented with no online broker having a particularly big market share
My hypothesis is that by 2010, with only a few online book retailers such as Amazon, you could provide 95% of the books in the market.
Is Industry Stable?
Yes. I would say there’s been little innovation in online brokerage in the last years
Yes. By 2008-2009, I believe online bookstores did little experimentation
Is There Multi-Homing
Yes. Steve Carpenter, Cake founder and CEO, mentioned how individual investors had many accounts and they were requesting to have all their accounts linked to Cake Financial.
No. There are many online retailers that customers may use to buy their books, but this cannot be considered as multi-homing.
Are There Network Effects?
Yes (in theory). Although there was no virality as top performers had no interest in sharing their portfolio strategies.
Yes. Indirect Network effects of a 2 sided platform. Besides, both word of mouth and the “plant a tree” initiative propelled virality.

For Cake Financial screen scraping turned out to be the wrong decision as it consumed a lot of the company resources (multi-homing increased the number of sites that had to be screened) and the payoff was small given its little virality. For Chegg, screen scraping was a tool that allowed them to grow quickly and gain market power.

To summarize, I would say that data scraping may not be the best way to run a company long-term, but it can definitely be a good intermediate solution that could propel the growth of a start-up under certain conditions.