How to predict cricket scores [Excel + Machine Learning] - 13 minutes read


How to predict cricket scores [Excel + Machine Learning] Chandoo.org

Can we predict cricket match score in Excel? Using machine learning, ensemble modeling, multiple regression and Excel formulas we can. This tutorial explains how.

Cricket world cup is on. Both my homes (India & New Zealand) have done well so far in the tournament and if things go OK in the last couple of matches, they should qualify for semi-finals. The games are happening in UK, which is 12 hours behind New Zealand. You know that means?

Yes, lots of sleepless nights in Wellington.

As I watch these games, I notice that every once in a while they show a “score predictor“. It will tell you what the final score could be based on the proceedings of the game so far.

So I thought, hmm, May be I should make one of those in Excel?

That is what I did. I created a machine learning model in Excel to predict cricket score. Sounds interesting? Read on.

If you are thinking “the only cricket that keeps me up all night is the damned chirping one in my basement”, then don’t worry.You need very little knowledge of cricket to understand the techniques. Once you know the ideas you can apply them to many other real life problems like predicting sales next year or student absences next term or electricity usage in the new plant.

Just a quick note if you are not familiar with cricket. In a typical one-day match, two sides compete. The game starts with one of teams batting first and scoring some runs in 50 overs. The next team then tries to beat that target set by first team in 50 overs.

Let’s start by defining the problem. We want to create a cricket score predictor that takes the inputs:

Our predictor should tell us what could be the final score at end of 50 overs.

For example, we could ask, “Australia scored 52 runs in 10 overs losing 1 wicket. What would be the final score?” and our predictor can provide a guess – say 305 runs.

The final score of a team in a cricket match depends on many things, including:

If we try to incorporate every little thing that matters, we will never be able to construct our prediction model. So let’s ignore everything except those 4 parameters (C, Rs, Wr, Or) listed above.

Let’s say Rp denotes our predicted Runs. We can define Rp as

But we know that Rp = Rs + something

This is because total runs at the end of 50 overs will be somethingmore than Runs Scored at the time of prediction.

If we can find somethingour problem is solved. But how?

Run rate is ratio between runs scored and overs completed. So if India scores 342 runs in 50 overs, their run rate is 6.84. As our prediction model is for 50 overs, if we know the Run Rate, we will know final score.

Given these variables, we can rewrite Rp (Runs predicted) as

So if we can build a model to predict RRr, we can calculate Predicted score.

We can argue that Run Rate in remaining overs will be a function of (country, run rate so far, overs remaining, wickets remaining)

We can further argue that each country has specific strengths and abilities when it comes to batting. So, if we define a set of functions, f1(), f2()…, fn() where fn is

We can then call the relevant function based on which country we are predicting the score for.

Regression of course. There are many sophisticated machine learning algorithms. But for something straight forward like Run Rate (remaining overs), we can create a simple multiple regression model.

Say RRr can be written as

Given a set of training data with RRs, Wr, Or and RRr, we can use LINEST() function in Excel to calculate {m1, m2, m3, const} that fits the sample data. Once we have the multipliers and constant value for each country, we can predict the score for any situation. Its that simple.

As with everything else in life, cricket matches too have significant variability. For that reason, rather than one regression model per country, why not create 10 of them per country and the average the prediction?

If this sounds like a bunch of bs, don’t worry. This is an actual machine learning technique known as Ensemble Modeling.

The idea of ensemble modeling is simple. We build multiple models from the training data and then combine the results of all models when making predictions.

See this picture to understand how a typical Ensemble Model works.

This way, we can create more variation in input scenarios and create a robust model.

For the sake of simplicity, let’s say we want to build 10 regression models for each country.

How to aggregate the ensemble model results?

We will end up with 10 predicted Run Rates (Remaining overs). We can simply average these 10 to come up with final prediction. You can also assign weights to the models and do a weighted average. Let’s stick with simple average.

That’s it. But if you want more text, here we go. Almost all machine learning models follow this pattern. They look at some data to construct the model. Once that model is ready, we then test it on a differentdata set to see how satisfactorily it performs. If the results are not up to scratch, we back to step one and fine tune the model.

The easiest of all these is a regression model. That is the same one we will be using too.

There are many fine websites for finding current score or recent match results. But in order to train our model, we need a collection of historical match data by each over. This is notoriously hard to get. Thankfully, there is cricksheet. They have historical one day match data at ball by ball level for 1,400 + matches in CSV format. (here is the downloads page)

Note about data: I noticed that cricsheet doesn’t have all matches data. For example I could not find any 2018 games in the data set I downloaded few days ago. It doesn’t really matter as we are using a large sample of data spanning several years.

As you can see, this data is at a too detailed level than what we need.

So I used Power Query to combine 1,400 files, reshape the data to over by over scores and then calculate total score, overs and wickets at every 5 over interval until end of the game. I then took only the recent 300 games as very old performances have little impact on current scoring patterns.

Sorry for not explaining the Power Query or Excel formula steps. That would get too technical and this post will never end.

Once reshaped, my data looks like this:

We can then derive additional columns,

Using 2,3,4 & 6 we can create our regression models.

But before we go there, let’s split the data in to training & test data sets. For this example, I choose the latest 50 games as test data set and everything older as training data. Instead of creating two separate tables, I just added a column at the end to look at [Match ID] to tell me whether something is test or training.

Also, we do not need to use last over data for training. At the end of game there is nothing left to predict, so there is no point of using last over data when training the model.

I have added a data point ID as column to this table so I can uniquely identify all data points when sampling training data. It is [Data point number]

Don’t freak out. We are still on topic. Bagging is the technical term for the concept of randomly sampling data, building models and then aggregating (ie bagging) at the end.

Our bootstrap approach is rather simple and naive.

A note on const: When I was building my cricket score prediction models, I realized that setting const=0 gave me a better R2 (ie the model fits well with training data). So I set the 3rd parameter of LINEST() to FALSE (ie no need for const). You may want to keep it on for other types of models.

In my training data from cricsheet, we have 16 countries. That means we need 16*10*50 = 8,000 data points to construct the models.

Using a bunch of RANDBETWEEN, INDEX+MATCH and COUNTIFS, I was able to construct this grid.

Once data grid is ready, we can create a bunch of LINEST() formulas to tell us the multipliers (m1, m2, m3) for each model. This can be done 160 times (each of the 16 countries need 10 models). But I am very lazy. So I used INDEX() formula to fetch arrays of 50 cells so that LINEST results can be tabulated nicely. This is how our multiple regression model looks:

As you can see, our model has very high R2 values. This is promising.

While high R2 values are good,you should not trust the model blindly. You should also check if the relationship between output (Run Rate in remaining overs) and inputs (RRs, Or, Wr) is chance. This can be done by looking at F statistic and F distribution probability. I have not bothered with this step for all of the data, but I did check for few samples to see if the F probability is low (low means relationship is not random).

Learn more about F statistic and how to interpret the results.

As you can see, each model predicts Run Rate (in remaining overs). But we need to predict the score. Given the inputs:

We can calculate predicted Runs (Rp) as

In simple words, our final prediction is Runs already scored + average of 10 predicted run rates timesremaining overs.

Now that we have our shiny ensemble models, let’s go test them. I have extracted score data from last 50 games by innings. I then filtered away any games with less than 50 overs played (canceled due to rain, chased before the last ball etc.)

This is what we have.

For prediction, we also need to know what were the runs scored and how many wickets they had in hand at certain over. I started by creating a scrollbar to select the over (any multiple of 5 between 5 and 40). Then we fetch the relevant inputs from test data and run the model against them to calculate predicted score. I then compared this against actual score to see what kind of error and accuracy our model is getting.

This involved using some crazy, but fun MMULT and INDEX functions (ofcourse, TRANSPOSE too). It is 2:19 AM as I am typing this. That means, Unfortunately, it is too late in the game to explain the formula logic here, so I will leave it to your imagination.

Here is how our model compares with actual results at 15 overs.

And this is how it works after 25, 35 and 40 overs. As you can see, accuracy improves the later in game you ask for prediction.

Right now, as I am typing this, India is playing against Bangladesh. India have score 314 in 50 overs. I wanted to see how our model predicts the score at various points in game. As you can see, it gets a little optimistic (as India didn’t loose a wicket until 30th over) but the prediction gets closer since 35th over.

If you want to examine the calculations, predict your own scores or just want to see how its all done, here is the file.

Feel free to mash up the data to create your own prediction tool.

This is an interesting topic and I am sure you want to know more. See below references to understand the concepts better.

I had so much fun creating this. I did have a few false starts and made models with wrong equations, but eventually came up with something that provides sensible prediction. I am happy with the way it turned out. Although I couldn’t explain every little thing about the model in this post, I hope you are able to fill those gaps in.

Do you like this prediction model in Excel? Are you surprised to see a complex machine learning algorithm implemented in good ol’ spreadsheet? Share your thoughts in the comments.

Source: Chandoo.org

Powered by NewsAPI.org

Keywords:

CricketMicrosoft ExcelCricketMicrosoft ExcelEnsemble learningStatistical modelRegression analysisMicrosoft ExcelCricket World CupIndia national cricket teamNew ZealandNew ZealandSleepless Nights (Lindisfarne album)WellingtonMachine learningCricketCricket (insect)Up All Night (Blink-182 song)KnowledgeCricket (insect)IdeaRealityCricketOver (cricket)Over (cricket)CricketOver (cricket)Australia national cricket teamRun (cricket)Over (cricket)WicketRun (cricket)CricketEvery Little Thing (band)PredictionConceptual modelPredictionRun (baseball)Run rateOver (cricket)India national cricket teamOver (cricket)Run rateOver (cricket)3RRRRun rateOver (cricket)Run rateFunction (mathematics)Regression analysisMachine learningRegression analysisRegression analysisFunction (mathematics)Money supplyDataCricket (insect)ReasonRegression analysisArithmetic meanMachine learningMathematical modelEnsemble learningMathematical modelGraph (discrete mathematics)VarianceRegression analysisWeighted arithmetic meanString (computer science)Machine learningScientific modellingDataScientific modellingStatistical hypothesis testingRegression analysisDataComma-separated valuesData setDoesn't Really MatterDataTimeDatabaseDataData setSeparate TablesSampling (statistics)Freak Out!ConceptRandomnessSampling (statistics)DataMathematical modelIndustrial engineeringBootstrapping (statistics)AlgorithmCricket (insect)PredictionMathematical modelCoefficient of determinationMathematical modelStatistical parameterStatistical modelData gridRegression analysisRegression analysisCoefficient of determinationValue (ethics)Coefficient of determinationValue (ethics)Value (ethics)Conceptual modelCorrelation and dependenceInformationRandomnessF-testF-distributionProbabilityDataSample (statistics)ProbabilityCorrelation and dependenceRandomnessF-testOver (cricket)Over (cricket)ScrollbarMathematical modelErrors and residualsAccuracy and precisionMathematical modelFunction (mathematics)TransposeMathematical logicIndia national cricket teamBangladeshIndia national cricket teamIndia national cricket teamMashup (music)ConceptPredictionEvery Little Thing She Does Is MagicDo You Like ThisPredictionConceptual modelMicrosoft ExcelMachine learningSpreadsheet