Applications of Generative AI in Sport: Q2 Update (Part Two)

In part two of our latest AI Trends in Sport feature, Chief Scientist Patrick Lucey outlines how Opta Vision solves a key challenge which has held back soccer analysis for the past 25 years, using a combination of computer vision and generative AI.

If you missed part one last week, you can find it here.

The key challenge of capturing player location and movement data from video in soccer is that remote video doesn’t provide a uniform perspective of the match.

To track a match from remote video, just one camera-angle is utilized. It’s the main/game-camera view – normally situated at the half-way line at a reasonably high-angle. The reason this is the only angle utilized, is because it contains the required information such as the side-lines, centre-circle, 18-yard-box etc., to calibrate the camera. Other views do not contain such information, making camera calibration virtually impossible.

However, even using the high-angle game-camera view, on average only 11 of the 22 players are normally in view. And there are often close-ups and replays – periods where previously no player tracking data could be captured.

The amount of time that replays/close-ups are used varies from game to game; some games have minimal close-ups, and then some have a lot – as much as 20% of the game.

Clearly, there are significant limitations for meaningful analysis of a team game like soccer if 20% of the game events, and 50% of the off-ball runs that players make are not captured.

Take a look at these two examples. The first (top) depicts when 11 of the 22 players are out of view, and the second example (bottom) shows when all players are out of view, due to a close-up.

These two examples are taken from the same segment of play. Firstly we have the game-camera for a period of time, missing half the players from each team. Then we have a close-up for about 8 seconds, missing 20 players. The close-up contains three passes before a through-ball is played down the right-hand side of the pitch.

Using standard remote tracking, which does not capture tracking data during close-ups, we would miss the position and movement of most of the players and possibly more critically, these 3 passes – and most importantly the last pass which leads to an attacking play.

Key passes are rare and highly important. Missing key passes, and the passes that lead to the key pass, and the influence and decisions of other players, therefore leaves a big gap in analysis.

Being able to fill that gap through obtaining complete tracking data can thus enable complete analysis to occur. But how?

Enter Generative AI-powered Opta Vision

Human experts are pretty good at estimating what is happening when they can’t see things in sport, based on what they’ve seen in the past and knowledge of how different teams, players and coaches play in different situations. The question is, how can we get a computer to learn this and ‘impute’ the missing detail?

As previous articles in this series have explained, Generative AI models trained on text can correct an incorrect sentence or fill in a missing word. Models trained on images can use fill and expand (in-painting and out-painting) to complete an image. Multimodal models trained on text, images and videos, like OpenAI’s text-to-video technology “Sora” can generate a complete video from just a textual description.

For soccer, the language we have created utilizes both our event data (i.e., what happened on the ball and who was involved) as well as our tracking data (player location and motion). In a similar way that Sora learned the mapping between text and video, Stats Perform learned the mapping between events and tracking data – which enables us to solve this problem.

By having the remote tracking data before and after the on-ball event, and then having the information about which on-ball events / actions occur and through which players, our model (which is trained on an enormous amount of our proprietary Opta data) has enough context to accurately estimate (or ‘impute’) where these players are. See our results below – in my view, it is magic!

For this work, we were inspired by the recent work in the Autonomous Vehicles domain, which do something similar – using the maps as their “text equivalent” and then using computer vision to map the objects in the world into this “map space”.

As we are dealing with visual data, diffusion models are preferred for visual tasks such as image or trajectory generation, because they excel at capturing fine details and producing high-quality outputs. For sequential data like text and text-based tasks (e.g., ChatGPT and Gemini), transformer neural networks are better suited. While diffusion is a different approach from transformers, it still falls under the umbrella of generative AI because it can create new, realistic-looking images (or in this case, generate realistic trajectories of missing players).

As mentioned, the results are quite “magical”. But more importantly, this solves a key problem in soccer as now all passes can be analyzed in the context of the location and movement of other players – something we refer to as “complete analysis”.

So we can do the same type of analysis from remote video that we could do from in-venue, which is a massive paradigm shift in unlocking insights from more players, teams and leagues.

It also enables us to create complete data from past games. As we progress on this journey, you will hear more from us on this. But we showcased this recently at the MIT Sloan Sports Analytics Conference where Harry Hughes, from the Stats Perform AI team, did an amazing job presenting this work – see here for full details, together with a link to a video of the presentation.

Why can’t a CV system track during close-ups?

As you can see in the example on the bottom-left, we can see the players clearly (i.e., white jerseys), so detecting these players via a CV system is quite easy.

However, as this is at ground-level, it is virtually impossible to estimate where those players are in “pixel space” (i.e., the images) in relation to the rest of the players and the pitch. That kind of reasoning for positional and movement detection is much easier to do in “tracking space” (i.e., the top-down pitch view).

A leading figure in the AI space, Yann LeCun, recently mentioned that modelling the world in ‘pixel space’ is inefficient and impossible to solve. We agree, and that insight is the key to solving this challenge of generating complete tracking data from remote video. Our approach to generating tracking data essentially treats the ‘tracking data space’ as a 1,000,000:1 compression from the pixel space.

The beauty of operating within the tracking data space is that it also “binds us to the real-world”, as it limits the possibilities to the pitch dimensions (105x68m on average in soccer), and the additional context of the events constrain it even more.

Why stop at player tracking data? Could CV systems detect “event data” straight from video?

First of all, let’s define what “event data” is. Using soccer as an example, event data refers to the actions that players perform within the game and the decisions officials make. They include free-kicks, goal kicks, corners, throw-ins, touches, passes, dribbles, shots, goals, own-goals, saves, headers, tackles, interceptions, fouls, penalties, yellow cards, red cards etc.

Positional and movement data combined with event data provides the complete view of the game. Without both, it’s impossible to analyze and predict player decisions and capabilities in specific situations.

Some key things to note about “events” are:

Many events are actually multimodal in nature – both visual and audio (i.e., referee’s whistle) as they are dependent on human referee decisions. It is only a foul, penalty, off-side, yellow-card, red-card, corner, goal if the human referee decides it is that event. Even a goalkeeper touching a shot over the bar can only be a save, if the referee awards a corner.
Some events have durations. A pass has a starting location and ending location if the player successfully receives it.
Some events can change after the fact due to VAR or assistant referee’s intervention.
Many events occur with multiple players in close proximity and require close assessment to accurately and consistently detect and classify according to prescribed definitions.

When you consider that teams and media need event data to be collected live, consistently, and accurately for it to be useful, for hundreds of elite men and women soccer competitions across the globe, we can see the necessity of having expert humans in the loop for both situations where different views are encountered, but also for interpreting the referee decisions (or change in decisions). Also, even when there are 10-12 cameras and a chip in the ball – human intervention is needed as exemplified by the semi-automatic offside detection system used in the 2022 FIFA men’s World Cup.

So the input sources of sports data could be thought of as multimodal, incorporating inputs from human collection as well as via computer vision. The complementary nature of the input data, as well as the redundancy baked into this process ensures that complete and accurate data is captured, regardless of what happens during the game, the input video or the decision-making of the referee.

Can’t GPT-4o or Gemini do image/video processing for sport as they are Multimodal? Why can’t you use that to create player tracking data?

Aside from the high cost and latency of using commercial APIs to process images and video data, using off-the-shelf models will only capture a portion of players who are clearly visible – resulting in a lack of critical “last mile” detail, including major gaps in play, due to the various nuances of sport and its many edge-cases.

The reasons why this is the case are:

Training Data: Models like GPT-4o and Gemini are trained on publicly available data which are based on image and caption pairing, and not domain-specific detailed sequences of sports data containing associated tracking and event data, and
Language: Models like GPT-4o and Gemini are learning the correlations between images/video and text. As mentioned previously, we want to learn the correlations between tracking data and event data, which is our images/video and text equivalent.

Another way to think of this is that sports data (tracking and event) is its own “language”, and GPT-4o and Gemini have been optimized for natural language (image and caption) – so Stats Perform’s foundation models are literally speaking a different language to models not trained on detailed sports data.

While it could be theoretically possible to learn a model between image/video and event data pairing – it isn’t practical due to the compression of the video to tracking (i.e., 1,000,000:1), tracking data grounds the data to the reality of the sport, and the tracking data is a very useful output by itself for visualization, interaction and interpretability (as we will show in the next article).

Is getting an AI agent to watch a live sports game and explain the rules the same as analysing a game?

This is a good question, and really gets to the heart of the difference in understanding a language (or understanding a topic like a novice or expert). Current multimodal LLMs based on natural language could recognize a video and identify it as a game of soccer (and maybe identify some of the teams and players – and potentially the score and time of the match from the score ‘bug’ on the screen). From that, it could explain the rules of soccer and maybe some history of the clubs involved, being something it could quickly glean from a search on Wikipedia (i.e., high-level text information which can be publicly found on the internet).

However identifying what sport is being played and detecting details about what’s happening in the game are two very different things. The next wave of GenAI isn’t to merely identify what sport is being played, which is what a novice could do – but it is to watch the game like an “expert”. To do this, you need to have the language of an expert. For soccer, it is understanding which formation a team is playing, or where a defender “should have been” in a given situation, what pass a player “should have made” and how costly a misplaced pass was that led to a counter-attack. It is also vital to connect it to the “live” element – something current off-the-shelf LLMs can’t do because they have a knowledge cut-off. So having both the event and tracking data, but also having the sports database “live and up-to-date” is extremely important – and is absolutely required to “watch” a game like an expert.

In the next article, we will discuss how we can use event and tracking as the raw language of sport and then transform it in a way where we can “watch” a game like an expert. In essence, the event and tracking data serve as the words (both textual and visual) – but are still unstructured as we need to form sentences, paragraphs and chapters up to an entire book (or library of books).

Is sports data structured or unstructured?

In terms of distinct events (like a pass or a shot), the data is structured. It can be stored and retrieved in a database. We can also store the tracking data as a row per frame of action.

The challenge is that a sport like soccer is a continuous game and to model the complete picture of 22 players moving and events occurring, we need to piece these together sequentially, and not independently. The analogy here would be to store each word or sentence from a book separately – it can ensure that it is stored, but it will lose context.

The tracking and event data associated with each event can be thought of a sentence within a book (where a game is a book). Another way to think of the tracking and event data that we have collected is to think of them as atoms, that we need to bring together into a coherent structure.

However the number of atoms (i.e., events and players) contain more permutations than there are atoms in the universe!

Generative AI models enable us to learn the right structure from these raw unstructured atoms.

Basically everything we do in AI is about representation or getting the right input structure for a computer to learn from.

To generate tracking data and events together, we need to take into consideration the position, velocities and accelerations of all players, as well as the events before. These are all time-varying. As above, this has more permutations than there are atoms in the universe – so our models with the raw data enable us to learn the right structure (which is otherwise known as the embedding).

In the next article we will do a deep dive on how we can utilize tracking data in many different ways – specifically on how to watch a game like an expert but also how to do search visually and interactively.

You spoke briefly about RoboSoccer in the last article, is any of this related?

We started this article talking about the history of computer vision in sport, but didn’t touch on one of the first real active areas of computer vision in sport in the 1990’s, which was RoboSoccer. It was one of the most active areas of research before the Moneyball revolution saw a focus on real-world sport.

The goal of RoboSoccer, or RoboCup, was to have a team of fully autonomous humanoid robots beat the best human soccer team in the world, on a real field by 2050. To get to this level, we need two things:

Create a robot that can start to move like a human, which is getting closer based on the recent release of the Boston Dynamics robot, and
Get these robots to “perceive” the world like a human player. But to do that, we need to generate enough examples for these robots to learn about the movement and structure of soccer.

I believe the work we have been doing within Opta Vision will help us analyze every game that has ever been played “completely”, and it will also start to provide the amount of complete data required to train a Robot to read the game like a human expert.

However, the beauty of sport is that it is played by humans – it is unpredictable, fluid, and it provides a live, unique and shared experience for people to enjoy. Although it is an interesting goal to pursue (a lot like teaching a computer to play Chess, Jeopardy! or Go – but much harder) – I think if anything, challenges like RoboCup will show how amazing humans are, and the level of preparation, practice and coaching required to both cognitively and physically perform at the highest level.

In future articles, we will highlight how we can use computer vision tracking data to understand sports like soccer, basketball and tennis. We will also highlight the role Generative AI plays in prediction.