What are models in Data Science
What are models in Data Science?
In a simple sentence, models are built so that we can make predictions about a trend we are investigating.
There are two categories of models supervised which we train and unsupervised which is done through neural networks.
In the article we will deal with the first category which has 3 subcategories regression, classification and decision tree.
The most familiar categories when we use regression
THE Linear which we try with a straight line to pass through all the price points.
THE Polynomial which depending on its degree we can use multiple parameters so that it approaches more points.
The problem that arises
Many times it is the right balance as the better the model fits the points there is a greater chance that future points will have a greater deviation and so we have overfitting.
Underfitting we have when the model does not go through most of the points then maybe we should change the model type or increase the degrees/folds.
Each model needs to be trained at first with a percentage of data and the remaining percentage is used for testing.
How accurate the model is
It is distinguished by its prices R^2 (how little deviation the values have from the model line) with values from 0~1.
As also from RMSE (the square of the mean of the difference between the predicted values and the actual values) with values above zero and when we say zero it means that we have the perfect model which sets it as impossible.
Detailed example
First we'll load all the libraries I might need so we don't get confused later.
import itertools import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import NullFormatter import pandas as pd import numpy as np import matplotlib.ticker as ticker from sklearn import preprocessing from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import RidgeClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import f1_score from sklearn.metrics import jaccard_similarity_score from sklearn import svm from sklearn import preprocessing from sklearn.impute import SimpleImputer from sklearn.linear_model import Ridge from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler,PolynomialFeatures import seaborn as sns %matplotlib inline #!conda install -c anaconda seaborn -y
We load a csv with car details into a dataframe, keep as many shipments as we want and make a modification to get the average consumption.
df = pd.read_csv('https://gist.githubusercontent.com/smatzouranis/acd3354f30ecc1e7cb90caee84650c3a/raw/61adad1fca973303f4af8bc378b3a5432b7371e7/autos_csv.csv') df = df[['make','fuel-type','horsepower','city-mpg','highway-mpg','price']] df['AVG-mpg'] = (df['city-mpg']+df['highway-mpg'])/2 dff = df[['make','fuel-type','horsepower','AVG-mpg','price']] dff.head()
Bar plot
Let's make a quick bar plot of the cost per brand.
It only needs 5 command lines.
makers = dff['make'] prices = dff['price'] fig, ax = plt.subplots(figsize=(8, 8)) plt.style.use('fivethirtyeight') ax.barh(makers, prices)
Data preparation
Because we had text on what fuel each name has we will make new columns one for oil and one for gasoline with 0 or 1 depending on what it has with the command .get_dummies and to pass the change you need the parameter inplace=True
dff = pd.concat([dff,pd.get_dummies(dff['fuel-type'])], axis=1) dff.drop(['fuel-type'], axis = 1,inplace=True) dff.head()
Now we will fill in the cars that do not have a price an average price from the rest.
X = dff[dff.columns.difference(['price'])] X =X.fillna(X.mean()) X.head()
y = dff[['price']] y =y.fillna(y.mean()) y[0:5]
Linear regression
Through the seaborn library we make a quick linear regression plot to see how the cost increases in terms of horsepower with just one line of code.
ax = sns.regplot(x='horsepower', y='price', data=dff)
We make 2 functions so that we can quickly and easily make the graphs.
def DistributionPlot(RedFunction,BlueFunction,RedName,BlueName,Title ): width = 12 height = 10 plt.figure(figsize=(width, height)) ax1 = sns.distplot(RedFunction, hist=False, color=”r”, label=RedName) ax2 = sns.distplot(BlueFunction, hist=False, color=”b”, label=BlueName, ax=ax1) plt.title(Title) plt.xlabel('Τιμή σε δολλάρια') plt.ylabel('Χαρακτηριστικά') plt.show() plt.close() def PollyPlot(xtrain,xtest,y_train,y_test,lr,poly_transform): width = 12 height = 10 plt.figure(figsize=(width, height)) xmax=max([xtrain.values.max(),xtest.values.max()]) xmin=min([xtrain.values.min(),xtest.values.min()]) x=np.arange(xmin,xmax,0.1) plt.plot(xtrain,y_train,'ro',label='Training δεδομένα') plt.plot(xtest,y_test,'go',label='Test δεδομένα') plt.plot(x,lr.predict(poly_transform.fit_transform(x.reshape(-1,1))),label='Προβλεπόμενα') plt.ylim([-10000,60000]) plt.ylabel('Price') plt.legend()
Split data into Train and Test
We divide the data into training and test with a percentage of 70-30.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) print("number of test samples :", X_test.shape[0]) print("number of training samples:",X_train.shape[0])
number of test samples : 62
number of training samples: 143
Grading the model
We define in a variable the class we will use and start the training with the data. Then with the score we see that the R^2 is 0.33 which shows that the line does not pass through most of the points.
lre = LinearRegression() lre.fit(X_train[['horsepower']],y_train) lre.score(X_test[['horsepower']],y_test)
0.3331272902078515
Cross-validate
We can cross validate to score in the following way by dividing the data into 4 pieces and testing each one separately.
Rcross=cross_val_score(lre,X[['horsepower']], y,cv=4) print("The mean of the folds are", Rcross.mean(),"and the standard deviation is" ,Rcross.std())
The mean of the folds are 0.4392710840512933 and the standard deviation is 0.16681254993011282
Price prediction
Let's try to build a model that predicts the price from the characteristics of the car.
lre = LinearRegression() lre.fit(X_train[['horsepower', 'AVG-mpg', 'diesel', 'gas']],y_train) lre.score(X_test[['horsepower', 'AVG-mpg', 'diesel', 'gas']],y_test)
0.40859095219326313
We define the predicted values as yhat.
yhat_train=lre.predict(X_train[['horsepower', 'AVG-mpg', 'diesel', 'gas']]) yhat_train[3:7]
array([[21400.06711013],
[ 9861.84714747],
[ 4442.33687254],
[ 5784.42153314]])
We see that we did not have good accuracy in our model
The accuracy of the model
Title='Train Data – Γράφημα προβλεπόμενων data vs actual data' DistributionPlot(y_train,yhat_train,"Πραγματικά","Προβλεπόμενα",Title)
Title='Test Data – Γράφημα προβλεπόμενων data vs actual data'
DistributionPlot(y_test,yhat_test,"Πραγματικά","Προβλεπόμενα",Title)
Polynomial but what degree?
Let's go to polynomial of degree 5 and see if we can get something better. This time only with the horses.
pr=PolynomialFeatures(degree=5) X_train_pr=pr.fit_transform(X_train[['horsepower']]) X_test_pr=pr.fit_transform(X_test[['horsepower']]) poly = LinearRegression() poly.fit(X_train_pr, y_train) LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) yhat=poly.predict(X_test_pr) print("Προβλεπόμενες τιμές:", yhat_test[0:4]) print("Πραγματικές τιμές:",y_test[0:4].values)
Προβλεπόμενες τιμές: [[ 5524.18191454]
[21532.75818567]
[14610.3150921 ]
[ -995.04541995]]
Πραγματικές τιμές: [[ 6795.]
[15750.]
[15250.]
[ 5151.]]
PollyPlot(X_train[['horsepower']],X_test[['horsepower']],y_train,y_test,poly,pr)
We see an improvement.
poly.score(X_train_pr,y_train)
0.6830658437904327
poly.score(X_test_pr,y_test)
0.6830658437904327
We can make a loop that tests the process with different degrees to choose the best one.
Rsqu_test=[] order=[1,2,3,4] for n in order: pr=PolynomialFeatures(degree=n) X_train_pr=pr.fit_transform(X_train[['horsepower']]) X_test_pr=pr.fit_transform(X_test[['horsepower']]) lre.fit(X_train_pr,y_train) Rsqu_test.append(lre.score(X_test_pr,y_test)) plt.plot(order,Rsqu_test) plt.xlabel('order') plt.ylabel('R^2') plt.title('R^2 με χρήση Test Data') plt.text(3, 0.75, 'Maximum R^2 ')
Text(3, 0.75, 'Maximum R^2 ')
Ridge model
We also do a final test with the rigde model to see if we will have even better accuracy.
pr=PolynomialFeatures(degree=2) X_train_pr=pr.fit_transform(X_train[['horsepower', 'AVG-mpg', 'diesel', 'gas']]) X_test_pr=pr.fit_transform(X_test[['horsepower', 'AVG-mpg', 'diesel', 'gas']]) RigeModel=Ridge(alpha=0.01) RigeModel.fit(X_train_pr, y_train) yhat=RigeModel.predict(X_test_pr) print('predicted:', yhat[0:4]) print('test set :', y_test[0:4].values)
predicted: [[ 6807.95323245]
[20468.43090602]
[14849.82259996]
[10178.31016628]]
test set : [[ 6795.]
[15750.]
[15250.]
[ 5151.]]
Rsqu_test=[] Rsqu_train=[] dummy1=[] ALFA=5000*np.array(range(0,2)) for alfa in ALFA: RigeModel=Ridge(alpha=alfa) RigeModel.fit(X_train_pr,y_train) Rsqu_test.append(RigeModel.score(X_test_pr,y_test)) Rsqu_train.append(RigeModel.score(X_train_pr,y_train)) width = 12 height = 10 plt.figure(figsize=(width, height)) plt.plot(ALFA,Rsqu_test,label='validation data ') plt.plot(ALFA,Rsqu_train,'r',label='training Data ') plt.xlabel('alpha') plt.ylabel('R^2') plt.legend()
<matplotlib.legend.Legend at 0x7f32d8769358>
After a TLDR post I think you will get an idea.
They will certainly seem difficult and complex to you, but with use and experience, because the code required is only a few lines, their daily use will be easy.