Cruise Ship crew size analysis

Predict cruise ship crew size based on ship characteristics.
Author: Niclas Kjäll-Ohlsson (niclasko@gmail.com)

Import libraries

Read data

Simple statistics over data

One-hot encode categorical variable Cruise_line

Here we create one column per value of Cruise_line where 0/1 denotes observation not having corresponding Cruise_line value or having. A possible usage of this one-hot encoding is for modeling interaction terms between numerical variables and Cruise_line, e.g. Orient*Tonnage. This to capture numerical variable distribution conditioned on Cruise_line.

Exploratory Data Analysis

Target variable analysis

We can see that the target variable crew size is fairly normally distributed around mode of 9.2, and mean of around 7.7, with a slight skewing towards lower values. We can see two outliers at around 20, which are both well above the 95th percentile (~12.5).

Independent variable analysis

Looking at the correlation (Pearson) between the different numerical variables (Tonnage, passengers, length, cabins, passenger_density & crew) we can see that there is a strong positive correlation between all variables except for passenger_density. Age shows a medium negative correlation to crew size. All in all we have indication of linearity between variables, except for passenger_density which shows a weaker correlation signal and a negative one. Variables with strong correlation are good predictors in a regression model, hence will likely predict target variable crew size well.

Pearson correlation is defined as:

$\rho_{X,Y} = \frac{\text{cov}(X,Y)}{\sigma_x \sigma_y}$, where
$cov_{X,Y} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$ and
$\bar{x} = \frac{1}{N} \sum_{i=1}^N x_i$ and
$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2}$

How to interpret Pearson correlation (source: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient):

The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables.

passenger_density is the variable with the lowest correlation towards target variable crew size, as well as to other variables. Hence, it is the most noisy variable. It could be a candidate for removal in modeling below.

Next looking at the variables with the strongest correlation to target variable crew size, we can see below that cabins, Tonnage, passengers, length have strong positive correlation with crew size. passenger_density has a weak negative correlation with crew size.

Clustering analysis

Next we perform a singular value decomposition matrix factorisation in order to cluster the data in 2 dimensions. We project the numerical variables (without target variable crew) along the two components (variable weightings) that explain most of the variance in the dataset.

Then the SVD projections are clustered using K-means algorithm. We can see in the plot below that data points in latent 2D SVD space are similar also in data space (hover points to see actual data).

We can see that the two SVD components (svd_X and svd_X) explain almost all of the variance in the independent variables over the dataset (Sum variance 0.954006).

Below we can see the variable weights for the 2 SVD components (svd_X and svd_Y) that the data is projected on.

Data points that are similar (close) in latent SVD projected 2D space are also close in data space. Hover data points in plot below to see.

Modeling

Here we build a linear model with L2 regularisation to predict crew size given the independent variables. The following steps are done:

The objective function of Ridge regression is to minimize the sum of squared errors between actual and predicted while at the same time minimizing the sum of squared regression coefficients. By doing so Ridge regression strikes a balance between under- and over-fitting. Thereby it effectively seeks to find the best tradeoff between bias (underfitting) and variance (overfitting). Ridge regression objective function:
$$objective=min(\sum_{i=0}^{n}(y_i-\sum_{j=0}^{p}x_{i,j}\beta_{j})^2+\lambda\sum_{j=0}^{p}\beta_{j}^2)$$
where $n$ is number of observations in data set, $p$ is number of model parameters (variables), $\beta$ is model coefficients and $\lambda$ is regularization penalty term. In Ridge regression $\lambda$ is called alpha.

Overfitting in a linear model can be caused by high weight attainment in the coefficients (i.e. high variance), especially for models with many variables. Ridge objective function seeks to minimize coefficients (both in positive and negative direction, hence squared), while also minimizing model error. The objective function forces the learning algorithm to learn the simplest model with the best performance (minimum error). Simpler models are more likely to generalize well to unseen data.

Perform grid search for L2 regularization parameter in Ridge regression. For each parameter value perform 5-fold cross validation over training data and return means for RMSE, Pearson and $R^2$ scores. Select best L2 parameter (alpha) as sample where distance is minimum between mean Pearson score for train and validation. We thereby seek to find a good estimate for balance between under- and over-fitting. Alpha is the hyper-parameter that is tuned below in order to improve the generalizability of the model.

Pearson correlation is defined above.

$R^2$ is defined as $R^2 = 1-\frac{\sum_{i=0}^{n}(y_i-f_i)^2}{\sum_{i=0}^{n}(y_i-\bar{y})^2}$ where $f$ is model prediction. I.e. model residuals over variance of target variable subtracted from 1. A model that always predicts the mean value of $y$ will have $R^2=0$, whereas a model which explain all the variance in $y$ will have $R^2=1$, in other words a perfect fit.

RMSE is defined as $\sqrt{\frac{1}{N}\sum_{i=0}^{N}(f_i-y_i)^2}$

Above we see top 5 positive model coefficients and bottom 5 negative coefficients. A coefficient (weight) can be interpreted as changing target variable prediction (crew size) by coefficient value for 1 unit increase in corresponding variable.

The above plots show that when applying a stronger L2 regularization weight, the Ridge regression model performs better on the validation data, i.e. better generalization. A smaller L2 weight causes the model to overfit to the training data and generalize poorly to the validation data. No convergence is achieved on best balance between bias and variance, but the plots show that stronger L2 regularization weight improves the generalizability of the model over validation data (5-fold cross validation).

Above scatter plots show predicted vs. actuals and Pearson score for train and test set. We can see that the Ridge regression model has learnt a good set of parameters which predict well both for train and test set. A Pearson score of 1 is perfect positive linear fit. We see .97 and .95 Pearson score for train and test set respectively.

Above is a print of Pearson, R^2 and RMSE scores for train and test set. We can see that performance is slightly worse for test set than for training set. However, still very good. We have found a good model.