Mercedes-Benz Greener Manufacturing

Tank Mitesh
Analytics Vidhya
Published in
14 min readJan 23, 2021

--

Can you cut the time a Mercedes-Benz spends on the test bench?

Table of content :

  1. Business Problem
  2. Objective of the Project
  3. Data Description
  4. First Cut Approach
  5. Explore Data Analysis
  6. Existing approach of the problem
  7. Feature Engineering part 1
  8. Feature Engineering part 2
  9. Feature Engineering part 3
  10. Model Selection and Hyper-parameter Tuning
  11. Model comparison
  12. Kaggle Submission
  13. Future Work
  14. Reference

1. Business Problem :

1.1 Introduction :

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium car makers. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of each and every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for so many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

1.2 Project Overview :

In an automobile industry, there is a testing department in which every vehicle that comes out from production manufacturing. Safety and reliable testing is a crucial part in the automobile manufacturing process.

The Mercedes -Benz automobile industry every day manufactures a huge rate in producing vehicles and send to the testing department which is a final stage in production. Every possible vehicle combination must undergo a test bench to ensure the vehicle is robust enough to keep passengers safe and withstand in daily use. More tests result in more time spent on the test stand, increasing costs to the company and generating carbon dioxide, a polluting greenhouse gas.

2. Objective of the Project :

The main objective of this project is to optimize or reduce the testing time in process of every production vehicle that comes under the test bench. By this optimization it certainly decreases the Carbon dioxide emission associated with the testing procedure.

We will use given Dataset and will create the robust machine learning algorithm which predict the time of the testing car.

We will download datasets and get overview of the problem from below link.

3. Data Description :

Variables with letters are categorical. Variables with 0/1 are binary values.

  • train.csv — the training set
  • test.csv — the test set, you must predict the ‘y’ variable for the ‘ID’s in this file
  • The ground truth is labeled ‘y’ and represents the time (in seconds) that the car took to pass testing for each variable.
  • R2 (Coefficient of determination): The metric used for evaluation was Coefficient of Determination (also called R squared). It is the square of the correlation between the predicted and the actual scores (range 0–1). In simple terms, how close the model predicts a real value data after being trained with sample data.
-squared metric

4. First Cut Approach :

As ‘y’ is a continuous variable, so it is a regression problem. Initially, I can think of the Decision Tree Regression approach as my benchmark model because it is the simplest regression algorithm we have.

5. Exploratory Data Analysis :

“ Give me six hours to chop down a tree and I will spend the first four sharpening the axe. - Abraham Lincoln ”

Exploratory data analysis is a core and important concept in machine learning. Most data scientists spend 80% of the time in Exploratory data analysis to understand the data, how data is distributed, data behavior and many more things, the more we understand the data more use aspects like which types of model we have chosen, which kind feature engineering will do.

5.1 Importing, Visualizing and Understanding Data:

#Check the Shape of the test and train data
print('train data shape - ', train.shape)
print('test data shape - ', test.shape)
#Check duplicates columns in the test and train data
print("Duplicate in trainData columns -",(train.columns.duplicated))
print("Duplicate in testData columns -"(test.columns.duplicated))

We have 4209 data points and 378 features in the training dataset. 4209 points and 377 features in the test dataset.

Train data contains 369 binary features, 1 class label and 8 categorical features. Test data contains 369 binary features and 8 categorical features.

We can see that from above conclusion, no Null value in train data and test data, no duplicate columns and rows in datasets.

5.2 Analysis of Y label :

Let’s perform different histogram transformations and CDF in y label. To check how Y labels are distributed and what is the 50th and 95th percentile. We will apply different transformation methods to reduce the skewness and kurtosis.

Histogram of Testing time of car :

Histogram of Testing time of car with Log-Transformation :

Histogram of Testing time of car with BoxCox-Transformation :

Histogram of Testing time of car with YeoJohnson-Transformation :

We can see that from the above plot, Y label is highly skewed and has a high value of kurtosis. We applied different transformations to reduce the kurtosis like log-transform, box cox-transform and yeo-Johnson- transform.

We observed from the above plots, the yeo-Johnson transformation got the best result -0.015639 skewness and 0.397150 kurtosis compare to log-transformation and box cox transformation

CDF of Testing time of car :

we can see CDF plot of the y feature. we can see that red line in the plot indicates the 95th percentile.

Total 95% of cars tested time below 121 secs, Only 5% of the cars tested time is above 121 secs. It can be an outliers.

50% of cars tested below 100 secs.

5.3 Analysis of Binary features :

Our first step is to check for the number of values in the binary features. We have a total 369 binary features, We use a random forest algorithm to select top 15 features which are most important for this task.

Feature importance of the Binary Features :

Top 15 important features

We see that X314 and X315 are top most important features from all of them. We will use this top features to create the synthetic features.

Count plot top important Binary Features :

Count-plot of top 15 features

Count plot of Binary Features :

We see that above 75% binary features have more than 50% of zero values, We conclude that higher number of features have zero value and fewer number of features have one value.

5.4 Analysis of categorical features :

Categorical features are important in datasets. Train and test sets contains total 8 categorical features, Which contains a,c,d,ab like values in features. We did Univariate and Bi-variate analysis in categorical features.

Count plot of Categorical Features :

Count plots of Categorical features

We observed from above count plots of Categorical features, some features like X4, X2 have constant values, these features will not improve model performance. So we will not use these features to create new features.

We did bi-variate analysis to categorical features, relationship between y label and categorical features. Most features were having good relationship with y label except, “X4” feature. Some points is above 150secs, out of the range of our 95th percentile, it can be outlier.

Scatter plot of Categorical Features :

Box-plot of Categorical Features :

We observed that ‘X4’ has very low variance as compared to other categorical features. So, it is not useful for modelling.

EDA Conclusion :

Y label was highly skewed, it can reduce the performance of the models. So, we transform our label data and reduce skewness and kurtosis.

We will use top 15 binary features for feature engineering, We will use all categorical features further, except “X4”.

6. Existing Approach :

Here how the Dhilip’s approach the problem,

According to his feature engineering and Modelling as follow

Technique 1: Dimensional Reduction using PCA.

Technique 2: Analysis in features and adding new Interactive features.

Technique 3: Selecting Top Features in Technique 2 using SelectKBest.

for the categorical features,

  1. Label Encoding
  2. Frequency Encoding
  3. Mean Encoding

He used tree based algorithm like Random Forest, XG-Boost Extra Tree algorithm and Stacking Regressor.

7. Feature Engineering Part 1 :

In Feature Engineering Part 1, we will use label encoding in categorical features.

Using PCA method we will create new synthetic 5 features.

For rest of the features we will use some feature engineering methods like Difference ratio, quadric encode and cos encode.

7.1 Difference ratio Encoding :

def difference_ratio(data, col1, col2, col3, col4):

''' Using data's different columns, combine and divide it
data -- datasets
col1, col2, col3, col4 = dataset's different columns
'''
array = (data[col1] + data[col2]) / (data[col3] + data[col4] + 1 )
return array
# function apply on the train data's different columns
train_1['X315_314_51_299'] = difference_ratio(train_1, 'X315', 'X314', 'X51', 'X299')
train_1['X299_300_301_271'] = difference_ratio(train_1, 'X299', 'X300', 'X301', 'X271')train_1['X50_88_51_31'] = difference_ratio(train_1, 'X50', 'X88', 'X51', 'X31')train_1['X46_263_119_261'] = difference_ratio(train_1, 'X46', 'X263', 'X118', 'X261')train_1['X136_118_136_60'] = difference_ratio(train_1, 'X136', 'X118', 'X136', 'X60')# function apply on the test data's different columns
test_1['X315_314_51_299'] = difference_ratio(test_1, 'X315', 'X314', 'X51', 'X299')
test_1['X299_300_301_271'] = difference_ratio(test_1, 'X299', 'X300', 'X301', 'X271')test_1['X50_88_51_31'] = difference_ratio(test_1, 'X50', 'X88', 'X51', 'X31')test_1['X46_263_119_261'] = difference_ratio(test_1, 'X46', 'X263', 'X118', 'X261')test_1['X136_118_136_60'] = difference_ratio(test_1, 'X136', 'X118', 'X136', 'X60')train_1['X136_118_136_60'].head()

7.2 Quadratic Encoding :

def quadratic_encode(data, col1, col2):        ''' Using quadaric formula and apply on the dataset different columns and some weights(values) '''

array = data[col1]**2 + 5 * data[col1] + 8
return array
# apply function and create new train data features
train_1['qua_encode_1'] = quadratic_encode(train_1, 'pca_feature1', 'X50_88_51_31')
train_1['qua_encode_2'] = quadratic_encode(train_1, 'pca_feature3', 'X270')train_1['qua_encode_3'] = quadratic_encode(train_1, 'pca_feature0', 'X300')train_1['qua_encode_4'] = quadratic_encode(train_1, 'X50_88_51_31', 'X315_314_51_299')# apply function and create new test data features
test_1['qua_encode_1'] = quadratic_encode(test_1, 'pca_feature1', 'X50_88_51_31')
test_1['qua_encode_2'] = quadratic_encode(test_1, 'pca_feature3', 'X270')test_1['qua_encode_3'] = quadratic_encode(test_1, 'pca_feature0', 'X300')test_1['qua_encode_4'] = quadratic_encode(test_1, 'X50_88_51_31', 'X315_314_51_299')train_1[['qua_encode_3']].head()

7.3 Cos Encoding :

def cos_encode(data, col1, col2, col3, col4):

aa = ((data[col1] + 0.8) + np.cos(data[col2] + 0.5))
bb = ((data[col3] + 3.5) + np.cos(data[col4] + 5.0))
array = (aa / bb)
return array
# apply function and create new train data features
train_1['cos_encode_1'] = cos_encode(train_1, 'X315_314_51_299', 'X50_88_51_31', 'qua_encode_3', 'qua_encode_2')
train_1['cos_encode_2'] = cos_encode(train_1, 'X136_118_136_60','X315', 'qua_encode_1', 'qua_encode_4')train_1['cos_encode_3'] = cos_encode(train_1, 'X300', 'X299_300_301_271', 'qua_encode_3', 'qua_encode_2')train_1['cos_encode_4'] = cos_encode(train_1, 'X50_88_51_31', 'X315_314_51_299', 'X118', 'X46_263_119_261')# apply function and create new test data features
test_1['cos_encode_1'] = cos_encode(test_1, 'X315_314_51_299', 'X50_88_51_31', 'qua_encode_3', 'qua_encode_2')
test_1['cos_encode_2'] = cos_encode(test_1, 'X136_118_136_60','X315', 'qua_encode_1', 'qua_encode_4')test_1['cos_encode_3'] = cos_encode(test_1, 'X300', 'X299_300_301_271', 'qua_encode_3', 'qua_encode_2')test_1['cos_encode_4'] = cos_encode(test_1, 'X50_88_51_31', 'X315_314_51_299', 'X118', 'X46_263_119_261')train_1['cos_encode_4'].head()

7.4 Feature Importance with different algorithms :

We use different algorithms for feature importance, we observed that our created synthetic features performed well in Light GBM and Random Forest, PCA features and synthetic feature works good in Ada-Boost model, binary feature and PCA feature performed good in GBT and XG-Boost.

We created a dataset which contains PCA feature + Label Encode feature + Synthetic feature.

8. Feature Engineering Part 2 :

We will do this part of feature engineering, we remove binary features below 0.01 value of variance.

We will remove categorical features which have constant values like the ‘X4’ feature.

We removed binary features, which feature have below 0.01 variance. We removed the “X4” categorical feature. And converted categorical features into labels, using the label encoding method.

Applied difference ratio, cos encoder, quadric encode method and created the synthetic features. Using the PCA method we created a top 100 feature.

9. Feature Engineering Part 3:

In this part of Feature Engineering we use Feature Engineering part 2 dataset and using SelectKBest method and store 250 features.

After featurization, we have 4 featured datasets on which we can try our model. These are as follows:

  1. Label features + PCA (5) features + Synthetic features
  2. Label features + PCA (5) features + Synthetic features + Y clip 150 sec
  3. Label features + PCA (100) features + Synthetic features + without low variance binary features + Y clip 150sec
  4. Top 250 features select by SelectKBest

10. Model Selection and Hyper-parameter Tuning :

We will use the Tree base model because it is robust to the outliers and performs great in high dimension data. We will use the algorithms below for this task.

  1. Decision Tree (Baseline Model)
  2. Random Forest
  3. XG Boost
  4. Ada Boost
  5. Light GBM
  6. GBM
  7. Stacking Regressor

We use the Decision Tree as a baseline model for this task, as it is the simplest regression model. Since, we have 300+ features, so Decision Tree regression will give us a decent baseline score for further reference and improvements.

10.1 Decision Tree :

clf = DecisionTreeRegressor(criterion='mse',
max_depth=3,
max_features='auto',
min_samples_leaf=1,
min_samples_split=2,
presort='deprecated',
splitter='best')

params = {'max_depth' : [2,3,4,8,10,15],
'max_features' : ['auto', 'sqrt', 'log2'],
'random_state' : [5,10,20,30]}
kfold_grid_search(clf, params, x2, y2, 10, kfold = 15,
search= 'random')

10.2 Random Forest :

clf =  RandomForestRegressor(bootstrap=True,
criterion='mse',
max_depth=5,
max_features=0.95,
min_impurity_decrease=0.001,
min_samples_leaf=2,
min_samples_split=8,
min_weight_fraction_leaf=0.0,
n_estimators=70,)
params = {'n_estimators':[40,50,60,70,100],
'max_depth':[3,5,6,7,8],
'min_samples_split':[2,3,4,5,6,7,8,9,10],
'max_features': [0.80,.95, 1.0],
'min_samples_leaf': [1, 2,3,4,5,6,7,8,9],
'min_impurity_decrease':[1e-5,1e-4,1e-3,1e-2,1e-1,0,1,10]}
kfold_grid_search(clf, params, x2, y2, fold = 10, kfold=15,
search= 'random' )

10.3 XG-Boost :

clf = XGBRFRegressor(colsample_bylevel=1,
colsample_bynode=0.8,
colsample_bytree=1,
learning_rate=1,
max_depth=5,
max_features=0.95,
min_child_weight=1,
min_impurity_decrease=1,
min_samples_leaf=1,
min_samples_split=5,
n_estimators=100,
n_jobs=1,
objective='reg:linear',
reg_lambda=1,
scale_pos_weight=1,
silent=True,
subsample=0.8,
verbosity=1)
xparams = {'learning_rate':[0.1,0.5,0.8,1],
'n_estimators':[70,80,100],
'max_depth':[2,3,4],
'colsample_bytree':[0.1,0.5,0.7,0.9,1],
'subsample':[0.2,0.3,0.5,1],
'gamma':[0.0001,0.001,0,0.1,0.01,0.5,1],
'reg_alpha':[0.00001,0.0001,0.001,0.01,0.1]}
kfold_grid_search(clf, params, x2, y2, fold = 10, kfold=15,
search= False )

10.4 Ada-Boost :

clf = AdaBoostRegressor(learning_rate=0.0001,
n_estimators=300,
loss= 'linear')
params = {'n_estimators' : [100, 150, 200, ],
'learning_rate' :[0.0001,0.001,0.01, 0.1],
'loss' : [ 'linear', 'square', 'exponential'],
'random_state' : [10,20,30]}
kfold_grid_search(clf, params, x2, y2, fold = 10, kfold = 15,
search = 'random')

10.5 Gradient Boosting :

clf = GradientBoostingRegressor(alpha=0.9,
criterion='friedman_mse',
learning_rate=0.01, loss='huber',
max_depth=3,
min_samples_leaf=1,
min_samples_split=2,
n_estimators=800,
n_iter_no_change=11,
presort='deprecated',
random_state=10,
subsample=1.0,
tol=0.0001,
validation_fraction=0.1,
verbose=0,
warm_start=False)
params = {'n_estimators' : [500,800,1000, 1500, 2000],
'loss' : [ 'huber', 'exponential'],
'learning_rate' : [0.01, 0.01, 0.1],
'max_depth' : [3,4,5,7]}
kfold_grid_search(clf, params, x2, y2, fold = 10, kfold = 15,
search = 'random')

10.6 Light GBM :

clf =  LGBMRegressor(boosting_type='gbdt', 
colsample_bytree=1.0,
importance_type='split',
learning_rate=0.01,
max_depth=5,
min_child_samples=50,
min_child_weight=0.001,
min_split_gain=0.0,
n_estimators=1000,
n_jobs=-1,
num_leaves=5,
reg_alpha=0.0,
reg_lambda=0.0,
subsample=1.0,
subsample_for_bin=200000)
params = {'min_child_samples' : [10, 20,50],
'num_leaves' : [5,6],
'max_depth' : [2, 3, 5],
'n_estimators' : [1000,2000,4000,5000],
'learning_rate' : [0.0001,0.001,0.01,0.1]
}
kfold_grid_search(clf, params, x2, y2.ravel(), fold = 10,
kfold = 15, search = 'random')

10.7 Stacking Regressor :

We use XG-Boost, Light GBM and Random Forest in Stacking Regressor, and as base learner Ridge Regressor.

stack = StackingRegressor(estimators= estimators,
final_estimator= Ridge())
cv_score = cross_val_score(stack, x2, y2.ravel(),
scoring='r2', cv= 5,
verbose=5, n_jobs=-1)
print('Mean Score:',cv_score.mean())
print('Standard Deviation:',cv_score.std())

11. Model Comparison :

In this Dataset we have 5 PCA features, all Binary features, Label encoded features and synthetic features.

This dataset contains 5 PCA features, all Binary features, Label encoded features and synthetic features and which gets value above 150sec in y target clip at 150 sec.

This dataset contains 100 PCA features, remove binary features which features have 0.01 below variance, Label encoded features and synthetic features and which gets value above 150sec in y target clip at 150 sec.

This dataset was created using feature selection method, using SelectKBest method select 250 features.

The best score we got from stacking regressor 0.54951 which surpasses our Random Forest Regressor result.

12. Kaggle Submission :

In the competition, the private leaderboard is calculated with approximately 81% of the test data and 19% train data.

My submission score :

13. Future Works :

  1. Implementing a Neural Network with defining proper layers and hyper-parameters will be able to achieve a good result in score.
  2. Implementing Super learner Machine learning technique and achieve a good result score.

14. Reference :

  1. https://ieeexplore.ieee.org/document/7506650
  2. https://www.appliedaicourse.com/
  3. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion
  4. https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36390
  5. https://medium.com/analytics-vidhya/mercedes-benz-greener-manufacturing-kaggle-competition-1c25c89e012

14. Profile :

You can find me on LinkedIn and GitHub.

--

--