Today, I came across a post on facebook which made me laugh so much. You can also have some good time seeing the below image.
The above picture clearly tells you how bad is taking run rate as a single factor to predict the final score in an ODI match. In ODI, many factors play a key role in deciding what the final score will be. Let’s look at some of the key factors:
- Number of wickets left
- Number of balls left
- On how much scores are the current batsman batting?
- How much the team had scored in last 5 overs?
- How much the team had lost wickets in last 5 overs?
- The nature of the pitch
- How strong is the batting and bowling team?
I will use some of these factors to predict score using machine learning algorithms. We use regression analysis in machine learning to predict the final score of a ODI match.
Preparing the dataset
I have scraped the commentary of 1134 matches from 2007-2017 and this only includes the matches which were not curtailed by rain. I have stored the data in csv file which can be found here. I have then used some of the columns from this csv file as my features. Most of the column names in the csv table are self explanatory. The last three columns can be confusing from their names.
striker = max(striker_runs_scored,non_striker_runs_scored)
non_striker = min(striker_runs_scored,non_striker_runs_scored)
total = final score of batting team
I will list down all the features that I have used.
- Current runs scored
- Current wickets
- Current overs
- Current run rate
- Runs scored in last 5 overs
- Wickets fell in last 5 overs
- Run rate in last 5 overs
import pandas as pd dataset = pd.read_csv('odi.csv') X = dataset.iloc[:,[8,9,10,11,12,13,14,15,16]].values y = dataset.iloc[:, 17].values
Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Training using linear regression
from sklearn.linear_model import LinearRegression lin = LinearRegression() lin.fit(X_train,y_train)
Predicting using trained model and calculating R square value
y_pred = lin.predict(X_test) print(lin.score(X_test,y_test)*100) #54% on my dataset
A custom accuracy function
I have used a custom function to predict the accuracy on my test dataset. A prediction is correct if the absolute difference between it and actual score is less than 20. You can define your own custom accuracy method.
def custom_accuracy(y_test,y_pred): right = 0 l = len(y_pred) for i in range(0,l): if(abs(y_pred[i]-y_test[i]) <= 20): right += 1 print((right/l)*100) #44% on my dataset
Predicting the final score from the above picture I posted
import numpy as np new_prediction = lin.predict(sc.transform(np.array([[66,0,4.3,14.67,66,0,14.67,36,28]]))) print(new_prediction) #My model predicted 375 which can be considered a good prediction
I am just a beginner to machine learning and I would be very glad if you can suggest me on improving the accuracy of the model. I also wish you try create your own models by using the data and improve on it. In case you need more features in dataset, do comment.
You can find the whole code and dataset at predicting first innings score in odi matches