Beruflich Dokumente
Kultur Dokumente
Dataset introduction:
train/test.csv
Column name
date_time
site_name
Description
Timestamp
ID of the Expedia point of sale (i.e.
Expedia.com, Expedia.co.uk, Expedia.co.jp, ...)
Data
type
string
int
posa_continent
int
user_location_country
int
user_location_region
int
user_location_city
int
orig_destination_distance
double
user_id
ID of user
int
is_mobile
tinyint
is_package
int
channel
ID of a marketing channel
int
srch_ci
Checkin date
string
srch_co
Checkout date
string
Column name
Data
type
Description
srch_adults_cnt
int
srch_children_cnt
int
srch_rm_cnt
int
srch_destination_id
int
int
hotel_continent
Hotel continent
int
hotel_country
Hotel country
int
hotel_market
Hotel market
int
is_booking
1 if a booking, 0 if a click
tinyint
cnt
bigint
hotel_cluster
ID of a hotel cluster
int
destinations.csv
Column name
Description
Data type
srch_destination_id
int
d1-d149
double
Features selection:
To classifier our user and hotel, we need to train the machine from those features:
1. User location:
x
user_location_country
user_location_region
user_location_city
orig_destination_distance
The first three features categorize customers living area. We plan to use these data to distinguish
users location from each other. We are not going to use the orig_destination_distance data as
some times this data cannot be detected, which will be increase the uncertainty of our predicted
results.
2. User information:
user_id
is_mobile
is_package
channel
user_id is used to mark each user separately. We will not include data is_mobile in ours
calculated features as booking through different devices will not affect a persons choice at all.
Data is_package is not going to be used as well. Because whether booking a hotel with a flight or
not will not affect a users decision too much. Data channel will not be counted as we are not
sure about its meaning.
3. Booking information:
x
srch_ci
srch_co
srch_adults_cnt
srch_children_cnt
srch_rm_cnt
srch_destination_id
srch_destination_type_id
x cnt
We will calculate the expected booking dates by subtracting srch_ci from srch_co. Named the
results as srch_length. We plan to use srch_adults_cnt; srch_rm_cnt; srch_destination_id,
srch_destination_type_id and cnt data to distinguish each booking properties. srch_children_cnt
will be counted as 1 or 0 instead of numbers of children, 1 for existence of children and 0 for not.
4. Hotel information:
x
hotel_continent
hotel_country
x hotel_market
All of these three data will be used to categorize hotels location property.
5. Decision information:
x is_booking
x hotel_cluster
Learning results are based on these two data.
Method:
To do the recommendation, there also has other method to address problem such as SVD,
collaborative Filter method, which are also useful for solving this problem. The straight forward
way is the SVM method. In this project, we plan to use Support Vector Machine (SVM) that is
based on statistical learning theory, which uses the principle of Structural Risk Minimization
instead of Empirical Risk Minimization [1]. SVM would find a maximal margin separating
hyperplane between two classes of data. In our case, one case is the users group and another is
hotels clusters.
After this method, if we still have time, we would like to implement SVD or Collaborative Filer
method to do the compare the running time and the accuracy.
To implement the SVM, there are two factors: mathematical programming and kernel functions
(Linear or non-linear function).
SVM Model:
min
22 +
, 0 2
,
=1
(
) + 1
Where is the error for a given training point
,
is the vector of coefficients for the best
separating hyperplane. B is the offset for that hyperplane, and C is a constant that represents the
emphasis that is to be placed on minimizing the error [1].
Once this problem has been solved, this equation can be transferred as following:
() = [ (
, ) ]
=1
In this case, the kernel can be switched as non-linear kernel, the most popular kernel is:
(
, ) = ,,2
Where is user-chosen parameter [1]. () can be the score for each hotel [2]. In training file,
we can make the classifier based on the users feature and then based on that classifier to assign
the user from test file who has similar feature into the hotel which has the highest score.
[1] Xu, J. A., & Araki, K. (n.d.). A SVM-based Personal Recommendation System for TV
Programs.2006 12th International Multi-Media Modelling Conference. doi:10.1109/mmmc.2006.1651358
[2] Ankit Gupta, Rohan Jain, Shiwei Song. Movie Recommendations Using Social Networks.