TalkingData Mobile User Demographics Kaggle

UNDERSTANDING THE PROBLEM

Trying my hand at Kaggle problems has given me a new dimension to ponder upon. You are forced to see the correlations between things you never imagined and forced to accelerate your mind power to brainstorm upon all possible options with a fixed set of data. After going through a few level zero problems and brushing through my R skills I tried my hand at a live competition. It took me a while to just figure out the problem statement and understand the data set for TalkingData Mobile User Demographics.

Kaggle link: click here

Problem Statement (trying to simplify it):

TalkingData is the largest third party platform which collects information from almost 70% of 500million mobile users in China. So, in this competition, we need to predict the age group of the mobile phone users using the data collect by TalkingData. Getting to know the age group of mobile users is very crucial for many companies as that helps them to do a very targeted marketing and present content accordingly.

Understanding the data:

Keep the above image handy while attempting this problem, its a life saver.

The train set contains device ids, gender, age and demographical group.
The test set contains only the device ids.

The type (brand, model) of mobile devices users use. phone_brand_device_model.csv
When and where (the time and date, latitude and longitude) a user uses his/her mobile, corresponding to which an event id is created. So for example, every time you unlock your phone location and time is recorded by TalkingData and it is given an event id. events.csv
Corresponding to each event id, all the installed apps in your phone and all the apps which are active is recorded (Each app has a unique app id). app_events.csv
The category of the apps. app_labels.csv and label_categories.csv

Important Observations at first glance

The data set for phone brands contain ALL the device ids in the train and test set. (Yeah!)
Only 31% of device ids from train and test set are in the file events.csv and this also means we have only 31% devices whose app details are given. (This will make our task more difficult and we would need to choose a model which incorporates this)
The device ids in test set can have brands and model numbers which are not present in the train set. (This means we have new factors in test which are not in train and our model also needs to consider this)
The above can be verified by using %in% in R
```
sum(train$device_id %in% events$device_id)
```
Each device id can have more than one event id and each event id can have more than one app id and each app id can have more than one category.

Feature Engineering

Number of apps in each event_id (for both active and installed apps)
Time of the day when the device is most used. It may be that if the device is not being used much from 10am to 4pm he/she falls in the working age. Similarly, the teenagers may use it more at night.
A correlation can also be found by checking if there is device who is using his/her device more during the weekends
The number of events occurring each day. (maybe we can take the average of this)
The location might play a role. But this would require some cleaning. There are a lot of (0,0) sets here which does not make much sense.
Depending on the model we use, we can group the categories of the apps. This should play a vital role. For example, the young female would use more of the apps in the category fashion, photography.

Now let us visualize the data as much as possible and then try to develop a model to predict.

EXPLORATORY DATA ANALYTICS

Now, let’s visualize the data for TalkingData Mobile User Demographics we have and see if it really makes any sense and if we can get some more insights. (Not an expert at this, hopefully, these will be better in the future posts)

Starting with the basic one, the histogram for gender in the train set. Here we can see that in our train set 70% are male and 30% female. Hence if we plot anything else wrt to gender, the male will be dominating.
Next, let’s look at the distribution of age in our train set.
Mobile brand and model number are going to be the deciding factor for our model, hence this graph between brand frequency and age group will be crucial. Here we the graphs is normalized so it will not give a good idea about the number, but luckily we can observe that there are a lot of brands which is gender and age specific.
Another thing to observe in the data is the location, i.e. latitude and longitude. There are a lot of cleaning to be done. One option is to replace the (0,0) with the mean corresponding to the device id. Also in the following world map, we can see that most of the points lie on land, which is good. And in the map of China, there are high clusters in the big cities. The map with the gender has very few data points compared to the other graph (only 30% of train)
Two other features we had thought about was the average number of events per day, and the most and least used times of the day. The graphs show the these do make an impact, so including these should improve our model.
Tried to group the app categories, but grouping the categories and plotting it for the train (only 30% data available) does not show much improvement. So in our model, we should try to increase the groupings of the categories or include all the categories.

PREDICTING GENDER AND AGE

So one of our most important factor while building the model has to be mobile brand and model number and we would need to predict almost 70% devices using only this.
For TalkingData Mobile User Demographics problem, random Forest or regression would not be a good option to consider for this prediction, because of a large number of factors. RandomForest can take up to a maximum of 54 factors and any new level in the test would throw up an error. So one of the models to be considered is Gradient Boosting which is very helpful in cases of large factors (Use One-Hot Encoding ) and moreover, xgBoost is very fast. Initially, I had tried using it with only mobile brand and model, and obtaining a good result I added the other variables which further improved the score. Currently, the score is ~2.26544 and an accuracy of almost 96% being it the top 20% (not the final result).

Some handy functions

%>% {dplyr}: Used for pipelining, can be very handy at times. Check an example below
fread {data.table} : A very fast alternative to read.csv, but this creates a data table. It can be converted to a data frame using the above pipelining method
```
example % as.data.frame()
```
library bit64: The device_id is in the format of integer64 for which this library is required but you we change that to character even while using fread using the parameter colClasses
The following are used to get the means of age wrt category. So we will have a list of unique categories and the corresponding means of ages
```
tmp = as.data.frame(sapply(split(events$Day,events$device_id), mean))
```
```
tmp = data[,.(tmp = mean(age)), by=.(brand, model)]
```

cut {base}: This is very useful to divide numeric values to factors if it is continuous. In our example, we can do it for latitude, longitude.

cut(location_new$longitude, 
                 breaks = c(-Inf, 80, 85, 90, Inf), 
                 labels = c("long1", "long2", "long3", "long4")
                 right = FALSE)

grepl {dplyr} : Search for a pattern and return a logical value (grep returns index)

uni_appcat$cat = ifelse(grepl("gambillard|puzz|poker|sport|shoot|rpg", 
			uni_appcat$category), "game", “Not_game”)

This replaces the values d1$phone_brand by adding phone_brand at the start
```
d1[,phone_brand:=paste0("phone_brand:",phone_brand)]
```
Also look at some basic functions like merge, unique, match (alternative: %in%), aggregate, paste, paste0, ifelse

Those who do not remember the past are condemned to repeat it.George Santayana

12 comments

YF Cycpspnt says:
June 1, 2017 at 1:01 am
I’ve been exploring for a little for any high-quality articles or blog posts in this kind of area . Exploring in Yahoo I eventually stumbled upon this web site. Reading this information i’m so satisfied to show that I’ve a very good uncanny feeling I discovered exactly what I needed. I most certainly will make certain to do not fail to remember this website and give it a look on a constant basis.
Cocetl says:
June 1, 2017 at 11:45 am
Heya i am for the primary time here. I found this board and I find It truly useful & it helped me out much. I am hoping to provide something again and aid others such as you helped me.
Anonymous says:
June 1, 2017 at 9:09 pm
I do not know whether it’s just me or if perhaps everybody else experiencing issues with your blog. It seems like some of the written text on your content are running off the screen. Can somebody else please provide feedback and let me know if this is happening to them as well? This could be a problem with my internet browser because I’ve had this happen before. Thanks
1. Himanshu says:
  June 2, 2017 at 10:02 am
  Hi. I will certainly check though as of now I haven’t received any such feedback from the users.
Anonymous says:
June 2, 2017 at 4:20 am
Appreciate this post. Will try it out.
ieeivk says:
June 2, 2017 at 6:52 am
Asking questions are genuinely fastidious thing if you are not understanding anything entirely, but this paragraph gives pleasant understanding yet.
Kzap says:
June 2, 2017 at 8:18 pm
Heya i’m for the first time here. I found this board and I find It truly useful & it helped me out a lot. I hope to give something back and aid others like you helped me.
Blzmga Gev says:
June 2, 2017 at 10:57 pm
Great site you have here but I was wanting to know if you knew of any forums that cover the same topics discussed in this article? I’d really love to be a part of online community where I can get suggestions from other knowledgeable individuals that share the same interest. If you have any recommendations, please let me know. Appreciate it!|
Anonymous says:
June 4, 2017 at 6:03 am
Very nice article, exactly what I needed.
gganorajj says:
June 10, 2017 at 6:13 pm
You’re so cool! I don’t suppose I’ve truly read anything like that before. So wonderful to find another person with genuine thoughts on this subject. Seriously.. many thanks for starting this up. This website is one thing that is needed on the internet, someone with a bit of originality!
Abhinav says:
May 15, 2018 at 7:27 am
Is the article complete ????
I can t find anything on the page after the line “Now let us visualize the data as much as possible and then try to develop a model to predict.”
has the article been removed ?
1. Himanshu says:
  May 15, 2018 at 1:37 pm
  Hi Abhinav. There are different tabs in this article. Please go to Visualize or Model building tabs to view the other parts. Will consider improving the UI whenever I have time. Thank you

TalkingData Mobile User Demographics: Kaggle

UNDERSTANDING THE PROBLEM

EXPLORATORY DATA ANALYTICS

PREDICTING GENDER AND AGE

Some handy functions

Bike Sharing Demand: Kaggle

Why a Job After being a Student Entrepreneur?

You May also Like

Indian Premier League Analysis

TFI Restaurant Revenue Prediction: Kaggle

Friendship Network Analysis IIT Jodhpur

Bike Sharing Demand: Kaggle

rLandsat R Package – First contribution to CRAN

12 comments

Leave a Reply Cancel reply