TFI Restaurant Revenue Prediction Kaggle (Rank 6 solution)

UNDERSTANDING THE PROBLEM

Data science is really an infinite road! After getting a good feel by solving a few problem statement and getting a decent accuracy, I wanted to deep dive into the field and know more on how I could improve my accuracy. Realizing the importance of variance and correlation was a good learning. In TFI Restaurant Revenue Prediction problem, I was surprised to get a rank of 6, but unfortunately, the competition wasn’t an active one.

Kaggle link: Restaurant Revenue Prediction

Problem Statement:

To predict the annual revenue of restaurants using the revenue of similar restaurants.
Luckily the problem statement is simple here and data is pretty nicely explained.

The data:

The first thing which is surprising here is the low data points in the training set compared to the test set. Using only 137 data points we need to predict for 1 lac points. Opening Date, Type of the restaurant, City, City groups and 37 demographical features.

Initial Observation

There are no missing data. (Whooo!)
The City name in the test and train do not match, hence would need to do some feature engineering or ignore these
The demographic features, P1 to P37 can have high correlation
We need to predict a continuous variable using factors and numeric values, so checking if there is are outliers in the response variable would be a good option. (Box plot can be used (check Visualization tab), and if there are outliers then we can take log of the revenue or ignore extreme values)

Feature Engineering

Firstly, the most obvious thing to do here would be to split the date and time. Make new variables for month, year, date.
With a max date substract the openning date. How old a restaurant is, could give a good insight about the annual revenue.
Can try using PCA for the 37 demographaic variables
For Cities, the grouping is already done according to the size. Using data(world.cities) from the package maps you could use the latitude-longitude or even population.

Now let us visualize the data as much as possible and then try to develop a model to predict.

EXPLORATORY DATA ANALYTICS

Now, let’s visualize the data for TFI Restaurant Revenue Prediction we have and see if it really makes any sense and if we can get some more insight.

As previously mentioned, let us draw the boxplot for ‘count’ to check for outliers.
Since there are a few outliers, let us ignore the restaurants with revenue greater than 16000000
The 37 demographic feature and the histogram plot for each of them
We have introduced another feature of how long ago (in months) the restaurant was launched. Let’s see if we get a good relation with revenue
Using random forests we can see the relative importance of various parameters in the initially given data and then build new feature accordingly.
The plot shows how important is the date of this prediction and we can even understand that intuitively.
The next graph shows the error vs the number of trees (estimators) used in the random forest.

PREDICTING REVENUE

Code for Rank 6 in Kaggle TFI Restaurant Revenue Prediction. Here we have factors as well as numeric variables in this prediction and we have to predict continuous variable. Had tried regression on this, which is the first method which comes in mind when it comes to predicting continuous values, but that did not work well. So here is the code using the random forest in R. It was totally unexpected to get such a rank with not much time spent on it.

# Author: Himanshu Sikaria 
# Title : Kaggle TFI Restaurant Revenue Prediction tutorial 
# Model: Random Forest
library(randomForest)
require(party)
library(dplyr)
library(reshape)

# Reading the data 
train <- read.csv("~/Downloads/train.csv")
test <- read.csv("~/Downloads/test.csv")

# Removing outliers  
train <- train[train$revenue < 16000000,]

# Combining train and test  
target = train$revenue
train_row = nrow(train)
train$revenue <- NULL
full = rbind(train, test)

# Plotting histogram for P1 to P37  
d <- melt(train[,-c(1:5)])
ggplot(d,aes(x = value)) + 
  facet_wrap(~variable,scales = "free_x") + 
  geom_histogram()


# Spliting date 
full$year <- substr(as.character(full$Open.Date),7,10) %>% as.factor()
full$month <- substr(as.character(full$Open.Date),1,2) %>% as.factor()
full$day <- substr(as.character(full$Open.Date),4,5) %>% as.numeric()

full$Date <- as.Date(strptime(full$Open.Date, "%m/%d/%Y"))

# How old the restaurant is  
full$days <- as.numeric(as.Date("2014-02-02")-full$Date)
full$months <- as.numeric(as.Date("2014-02-02")-full$Date) / 30
qplot(revenue, month, data=train) + geom_smooth() + ggtitle("Life of restaurant (months) vs Revenue")

# Removing columns which are not to be used  
full$Id <- full$Open.Date <- full$Date <- full$City <- NULL

# Spliting into train and test 
train = full[1:train_row,]
train$revenue = target
test = full[(train_row+1):nrow(full),]
row.names(test) = NULL

# Random Forest  
set.seed(147)
fit = randomForest(revenue~., train, ntree = 1000)
varImpPlot(fit)
pred = predict(fit, test, type = "response")

# Preparing the required output format 
test <- read.csv("~/Downloads/test.csv")
final = data.frame(ID = test$Id, Prediction = pred)
colnames(final)[2] = "Prediction"
write.csv(final, "tfi_rf.csv", row.names = F)

Those who do not remember the past are condemned to repeat it.George Santayana

12 comments

Fehdsben says:
May 16, 2017 at 6:24 am
Hello, just wanted to say, I enjoyed this post. It was inspiring. Keep on posting!
Scalese says:
May 16, 2017 at 7:48 am
I really like it whenever people come together and share opinions. Great website, continue the good work!
Oirnax says:
May 17, 2017 at 4:03 pm
Hey there! I could have sworn I’ve been to this site before but after reading through some of the post I realized it’s new to me. Nonetheless, I’m definitely delighted I found it and I’ll be book-marking and checking back often!
Granada11177 says:
May 18, 2017 at 1:57 pm
You actually make it seem really easy along with your presentation however I in finding this matter to be really one thing that I think I might by no means understand. It seems too complex and extremely large for me. I am taking a look ahead on your next submit, I’ll try to get the hold of it!
1. Himanshu says:
  May 18, 2017 at 3:30 pm
  Thanks for the feedback! I am sure you can understand this problem if you put in a little time. I have tried to be as crisp as possible. 🙂
Ransone says:
May 18, 2017 at 11:34 pm
I just like the helpful information you provide in your articles. I will bookmark your blog and check again here regularly. I’m fairly certain I will be told a lot of new stuff right right here! Good luck for the following!
bbpiqegvudm says:
May 20, 2017 at 10:39 am
I was wondering if you ever thought of changing the structure of your site? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having 1 or two pictures. Maybe you could space it out better?
1. Himanshu says:
  May 20, 2017 at 12:55 pm
  Thanks for the suggestion. I will keep it in mind for the next post!
Lylaog Cpikz says:
May 21, 2017 at 6:29 pm
Hey very nice blog!
Aamir says:
June 26, 2017 at 5:44 am
Thanks for a clear explanation. Good work!
saiteja says:
June 12, 2019 at 4:22 pm
hi can u send me codes to be excuted in the jupyter notebook for following tfi problem along in machine learning and output of it for the given email. hope u will send me
1. Himanshu says:
  June 30, 2019 at 1:13 pm
  Hi Saiteja. I have updated the code for the TFI problem under the “Model Building” tab in the blog itself. Hope that works for you.

TFI Restaurant Revenue Prediction: Kaggle

UNDERSTANDING THE PROBLEM

EXPLORATORY DATA ANALYTICS

PREDICTING REVENUE

Why a Job After being a Student Entrepreneur?

Indian Premier League Analysis

You May also Like

TalkingData Mobile User Demographics: Kaggle

rLandsat R Package – First contribution to CRAN

Bike Sharing Demand: Kaggle

Indian Premier League Analysis

Friendship Network Analysis IIT Jodhpur

12 comments

Leave a Reply Cancel reply