Estimating linear regression models with a VERY LARGE number of fixed effects in R - Part 1/2

It is very common for researchers (or data scientists) not to have an access to some the (independent) variables, which can potentially influence your dependent variable (e.g. market share, sales, choice of brand, etc). These omitted variables can potentially cause biased estimates of marketing mix effects.  

Fixed effect regressions (i.e. dummy variable regression) is widely used to control potential omitted variable biases. In a typical panel data or multi-level data settings in Marketing and Economics, you can easily end up with a VERY LARGE number of fixed effects to obtain consistent estimates. As an example, if you are trying to estimate price elasticities for each of 1,000 stores in 30 markets, which belong to the same/different retail chain, you might need to put in store-fixed effects (1,000) to control potential endogeneity. In addition, you might be worried about the potential correlations between unobserved market-level advertising and prices, thus you may need to put in market-time fixed effects. Assuming weekly data for 2 years (i.e. 104 weeks), this will result in 30 x 104 = 3,120 fixed effects. If we consider both store fixed effects and market-time fixed effects, a linear regression of unit sales of brands may need to control 4,120 fixed effects with 100,000 observations (or sometimes with millions of observations.) Most of common statistical software are not very suitable for this type of estimation. As a example, STATA can only handle up to 11,000 variables (or dummies). Even with 1000 dummies, the estimation takes really long to finish.

An extension of "within" transformation with more than one factor, which is called as "Methods of Alternating Projection" allows us to estimate linear regression models with a VERY LARGE number of dummies. lfe-package in R (written by Simon Gaure; based on alternating projection methods) is a handy tool for the estimation of linear regression models with a VERY LARGE number of dummies. The package has been tested on datasets with 15 covariates and approximately 2,300,000 and 270,000 group levels. The estimation of effects (felm) took ~ 50 minutes on a 8 CPU machine. Recovery of fixed effects (getfe) took 5 minutes.

Shown below are useful links for the lfe package (in R). I will show the test results of this package with fake data in the next posting.

1. CRAN site to download R package (lfe)

2. lfe manual (R package)

3. R documentation (felm)

4. R documentation (getfe)


 


 

Comments

Popular posts from this blog

Cracking Business Case Interviews for Data Scientists: Part 1

How The Influence of Multi-Tiered Private Label Brand Architecture Varies Across Retailers

Cracking Business Case Interviews for Data Scientists: Part 2