Is the influx of data science “lemons” hurting data science salaries?
There was a recent quarrel in the data science community on LinkedIn regarding MOOCs (Massive Open Online Courses) and whether aspiring data scientists should put these courses on their resumes. I don’t have a strong opinion on whether you should or shouldn’t do this–but the controversy did make me think about the tons and tons of masters programs, MOOCs, and online resources you can choose from to learn data science.
For better or worse, these resources are churning out “data scientists” at an incredible rate and it seems like everyone is a data scientist these days. This influx of data science applicants with credible and not-so-credible credentials coupled with the flawed data science hiring process has made for an interesting and polarizing job market–and has reminded me of a famous economic thesis: The Market for Lemons.
The economist George Akerlof’s 1970 paper, “The Market for Lemons: Quality Uncertainty and the Market Mechanism” illustrates the impact of information asymmetry and quality heterogeneity in a market.
In his paper, Akerlof discusses the problem of quality uncertainty in the context of the used car market. I will discuss this same problem of information asymmetry but in the context of the data science job market–and whether this might be hurting the salaries of highly-skilled data scientists.
There are two types of “agents” involved in the data science and analytics job search: the hiring companies and the applicants. They have the utility function U(x,y,t,q) = tqx + y, where x represents work, y represents money, q is work quality, and t is a parameter.
The applicants are endowed with x = 1 (they could theoretically work for themselves) and companies are endowed with x = 0 (since the position is unfilled). Both companies and applicants are also endowed with money (the y-good) and there are more companies hiring than there are applicants applying, such that nc>na.
The parameter t is what creates the existence of a data science job market. Applicants could go into self-employment and companies could leave the position unfilled. However, applicants value the tradeoff of exchanging work for money at t = 1, and a company values the tradeoff of exchanging money for work at a value greater than t = 1, specifically, t = (3/2).
Now, suppose that one-half of the applicants are less qualified and do not have the typical skills of a highly qualified data scientist. They have q = $50,000. The other half of the applicants are qualified and they do possess the skills necessary to be considered a highly qualified data scientist. They have q = $100,000.
You can think of “t” and “q” like this: the lowest offer a less qualified applicant will accept is $50,000 (or t * q = 1 * 50,000). A company understands that negotiation is part of the hiring process, and is willing to negotiate up to a $75,000 (or t * q = (3/2) * 50,000) offer with a less qualified candidate. On the other hand, the lowest offer a highly qualified applicant will accept is $100,000 (or t * q = 1 * 100,000). A company is willing to negotiate up to a $150,000 (or t * q = (3/2) * 100,000) offer with a highly qualified candidate. However, these scenarios only happen if there is perfect information–meaning, hiring companies know who are highly qualified and less qualified candidates.
Now consider this scenario: applicants know their skill level but hiring companies do not (the assumption here is that no matter what hiring companies do to try and detect “fake data scientists,” they will be unsuccessful. I believe the newness of the industry and the confusion of what a data scientist even is makes this a strong assumption). Hiring companies only know that the market consists of one-half lemons (less qualified applicants) and one-half plums (qualified applicants).
The figure below depicts the supply and demand curves in this market:
Let’s first consider the supply curve for the applicants. Between $0 and $50,000, no applicant (lemon or otherwise) would be willing to accept an offer, as the quality of the unqualified applicant, a lemon, is q = $50,000. Because applicants know their own skills, data science lemons are only willing to accept offers at $50,000 or higher.
Thus, between $50,000 and $100,000, only half of the applicants are available. Specifically, all lemons are available to companies. At offers of $100,000 or above, data science plums are willing to accept. Thus, from $100,000 onward, all applicants are willing to accept an offer.
Now, consider the demand curve for the hiring companies. Although the companies do not know who is a qualified and unqualified candidate, they know the fraction of lemons and plums on the data science job market and they are aware that applicants know their own skill level. Therefore, companies are willing to negotiate a data science lemon’s salary up to $75,000 (because t = (3/2) times 50,000 is 75,000). Companies know that no data science plums will accept this, because they are not willing to accept an offer below $100,000.
Because companies have t = (3/2) and they are unaware of applicant quality, they will also be willing to offer applicants between $100,000 and $112,500. This is the case because at a salary greater than $100,000, companies know that both data science lemons and plums would accept the offer. Therefore, hiring companies will negotiate up to $112,500, as (3/2) times 75,000 (the expected applicant quality, or (50,000 + 100,000) / 2 = 75,000) is $112,500.
What we end up seeing is an impact on the salaries of highly qualified data science plums. When companies were able to know for certain whether an applicant was qualified, data science plums were able to negotiate up to $150,000. Now, with this quality uncertainty problem, companies are only willing to negotiate up to $112,500.