Sie sind auf Seite 1von 4

Expedia Hotel Recommendations

Qingxuan Li, Bing Zhang


Department of Electrical and Computer Engineering
University of California, Davis
Introduction:
Recommendation system is the typical problem in machine learning. In this project,
Expedia would like to use recommendation system to provide personalized hotel to their users.
Expedia upload their customers dataset and wants people to use them to contextualize customer
data and predict the likelihood a use will stay at 100 different hotel groups.

Dataset introduction:
train/test.csv
Column name
date_time
site_name

Description
Timestamp
ID of the Expedia point of sale (i.e.
Expedia.com, Expedia.co.uk, Expedia.co.jp, ...)

Data
type
string
int

posa_continent

ID of continent associated with site_name

int

user_location_country

The ID of the country the customer is located

int

user_location_region

The ID of the region the customer is located

int

user_location_city

The ID of the city the customer is located

int

orig_destination_distance

Physical distance between a hotel and a customer at the time of


search. A null means the distance could not be calculated

double

user_id

ID of user

int

is_mobile

1 when a user connected from a mobile device, 0 otherwise

tinyint

is_package

1 if the click/booking was generated as a part of a package (i.e.


combined with a flight), 0 otherwise

int

channel

ID of a marketing channel

int

srch_ci

Checkin date

string

srch_co

Checkout date

string

Column name

Data
type

Description

srch_adults_cnt

The number of adults specified in the hotel room

int

srch_children_cnt

The number of (extra occupancy) children specified in the hotel


room

int

srch_rm_cnt

The number of hotel rooms specified in the search

int

srch_destination_id

ID of the destination where the hotel search was performed

int

srch_destination_type_id Type of destination

int

hotel_continent

Hotel continent

int

hotel_country

Hotel country

int

hotel_market

Hotel market

int

is_booking

1 if a booking, 0 if a click

tinyint

cnt

Numer of similar events in the context of the same user session

bigint

hotel_cluster

ID of a hotel cluster

int

destinations.csv
Column name

Description

Data type

srch_destination_id

ID of the destination where the hotel search was performed

int

d1-d149

latent description of search regions

double

Features selection:
To classifier our user and hotel, we need to train the machine from those features:
1. User location:
x

user_location_country

user_location_region

user_location_city

orig_destination_distance
The first three features categorize customers living area. We plan to use these data to distinguish
users location from each other. We are not going to use the orig_destination_distance data as
some times this data cannot be detected, which will be increase the uncertainty of our predicted
results.
2. User information:

user_id

is_mobile
is_package
channel
user_id is used to mark each user separately. We will not include data is_mobile in ours
calculated features as booking through different devices will not affect a persons choice at all.
Data is_package is not going to be used as well. Because whether booking a hotel with a flight or
not will not affect a users decision too much. Data channel will not be counted as we are not
sure about its meaning.
3. Booking information:
x

srch_ci

srch_co

srch_adults_cnt

srch_children_cnt

srch_rm_cnt

srch_destination_id

srch_destination_type_id

x cnt
We will calculate the expected booking dates by subtracting srch_ci from srch_co. Named the
results as srch_length. We plan to use srch_adults_cnt; srch_rm_cnt; srch_destination_id,
srch_destination_type_id and cnt data to distinguish each booking properties. srch_children_cnt
will be counted as 1 or 0 instead of numbers of children, 1 for existence of children and 0 for not.
4. Hotel information:
x

hotel_continent

hotel_country

x hotel_market
All of these three data will be used to categorize hotels location property.
5. Decision information:
x is_booking
x hotel_cluster
Learning results are based on these two data.
Method:
To do the recommendation, there also has other method to address problem such as SVD,
collaborative Filter method, which are also useful for solving this problem. The straight forward

way is the SVM method. In this project, we plan to use Support Vector Machine (SVM) that is
based on statistical learning theory, which uses the principle of Structural Risk Minimization
instead of Empirical Risk Minimization [1]. SVM would find a maximal margin separating
hyperplane between two classes of data. In our case, one case is the users group and another is
hotels clusters.
After this method, if we still have time, we would like to implement SVD or Collaborative Filer
method to do the compare the running time and the accuracy.
To implement the SVM, there are two factors: mathematical programming and kernel functions
(Linear or non-linear function).
SVM Model:

min
22 +
, 0 2
,
=1

(

) + 1
Where is the error for a given training point
,
is the vector of coefficients for the best
separating hyperplane. B is the offset for that hyperplane, and C is a constant that represents the
emphasis that is to be placed on minimizing the error [1].
Once this problem has been solved, this equation can be transferred as following:

() = [ (
, ) ]
=1

In this case, the kernel can be switched as non-linear kernel, the most popular kernel is:

(
, ) = ,,2
Where is user-chosen parameter [1]. () can be the score for each hotel [2]. In training file,
we can make the classifier based on the users feature and then based on that classifier to assign
the user from test file who has similar feature into the hotel which has the highest score.

[1] Xu, J. A., & Araki, K. (n.d.). A SVM-based Personal Recommendation System for TV
Programs.2006 12th International Multi-Media Modelling Conference. doi:10.1109/mmmc.2006.1651358
[2] Ankit Gupta, Rohan Jain, Shiwei Song. Movie Recommendations Using Social Networks.

Das könnte Ihnen auch gefallen