Beruflich Dokumente
Kultur Dokumente
Dataset introduction:
train/test.csv
Column name
Description
Data
type
date_time
Timestamp
string
site_name
int
posa_continent
int
user_location_country
int
user_location_region
int
user_location_city
int
double
user_id
ID of user
int
is_mobile
tinyint
is_package
int
channel
ID of a marketing channel
int
srch_ci
Checkin date
string
Column name
Data
type
Description
srch_co
Checkout date
string
srch_adults_cnt
int
srch_children_cnt
int
srch_rm_cnt
int
srch_destination_id
int
int
hotel_continent
Hotel continent
int
hotel_country
Hotel country
int
hotel_market
Hotel market
int
is_booking
1 if a booking, 0 if a click
tinyint
cnt
bigint
hotel_cluster
ID of a hotel cluster
int
destinations.csv
Column name
Description
Data type
srch_destination_id
int
d1-d149
double
Features selection:
To classifier the user, we need to train the machine from those features: {user_location_country;
user_location_region; user_location_city; user_id; channel; srch_adults_cnt; srch_children_cnt;
srch_rm_cnt; srch_destination_id, srch_destination_type_id; cnt } which can help machine to
identify the user from test file into the similar class of the users.
For the hotel part: { hotel_continent ; hotel_country; hotel_market; is_booking; hotel_cluster}
Method:
In this project, we decide to use Support Vector Machine (SVM) that is based on statistical
learning theory, which uses the principle of Structural Risk Minimization instead of Empirical
Risk Minimization [1]. SVM would find a maximal margin separating hyperplane between two
classes of data. In our case, one case is the users group and another is hotels clusters.
To implement the SVM, there are two factors: mathematical programming and kernel functions
(Linear or non-linear function). Based on the materials from the class, the non-linear function we
can use are logistic regression, relu, or sigmoid function to do the recommendation.
Linear SVM:
m
2
1
w2 +C i
w ,b , 0 2
i=1
min
i
such that y i (
w
x ib ) + i 1
xi ,
w
Where i is the error for a given training point
best separating hyperplane. B is the offset for that hyperplane, and C is a constant that represents
the emphasis that is to be placed on minimizing the error [1].
Once this problem has been solved, this equation can be transferred as following:
[
m
f ( x )=sign
i=1
ai y i K (
xi ,
y i )b
In this case, the kernel can be switched as non-linear kernel, the most popular kernel is:
2
x i ,
x j ,2
K (
xi ,
y i )=e
Where
is user-chosen parameter [1]. f ( x ) can be the score for each hotel [2]. In training
file, we can make the classifier based on the users feature and then based on that classifier to
assign the user from test file who has similar feature into the hotel which has the highest score.
[1] Xu, J. A., & Araki, K. (n.d.). A SVM-based Personal Recommendation System for TV Programs.2006
12th International Multi-Media Modelling Conference. doi:10.1109/mmmc.2006.1651358
[2] Ankit Gupta, Rohan Jain, Shiwei Song. Movie Recommendations Using Social Networks.