Welcome to SoccerData’s docs!

Release v1.7.0. (pip install soccerdata)

SoccerData Downloads Per Month Badge License Badge Python Version Support Badge

SoccerData is a collection of scrapers to gather soccer data from popular websites, including Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.

import soccerdata as sd

# Create a scraper class instance for the 2018/19 Premier League
five38 = sd.FiveThirtyEight('ENG-Premier League', '1819')

# Fetch data
games = five38.read_games()
forecasts = five38.read_forecasts()
clinches = five38.read_clinches()

Main features

  • Access current and historical soccer fixtures, forecasts, detailed match stats, event stream data and more.

  • All data is provided in the form of Pandas DataFrames with sensible, matching column names and identifiers across datasets to make working with the data and combining data from multiple sources easy.

  • Data is only downloaded when needed and cached locally to speed up your analyis scripts.

  • Integrates with the socceraction package to allow analysis of event stream data.

Do you like it? Let’s dive in!

Getting Started

New to soccerdata? Well, you came to the right place: this tutorial will walk you through installing, configuring, and using the library. By the end of this tutorial, you will be able to scrape data from the top-5 European leagues and use it to create your own data-driven analyses.

Installation

SoccerData can be easily installed via pip:

python3 -m pip install soccerdata

Scraping data

Each of the supported data sources has its corresponding class for fetching data with a uniform API. For example, the FBref class is used to fetch data from fbref.com.

import soccerdata as sd

# Create scraper class instance
fbref = sd.FBref()

Once you have a scraper class instance, you can use it to fetch data. See the the examples and API reference for the full list of options available for each scraper. For example, to fetch aggregated shooting stats for all teams:

# Create dataframes
season_stats = fbref.read_team_season_stats(stat_type='shooting')

The data is always returned as a convenient Pandas DataFrame.

league

season

team

#Pl

90s

Gls

Sh

SoT

SoT%

Sh/90

SoT/90

G/Sh

G/SoT

Dist

FK

PK

PKatt

xG

npxG

npxG/Sh

G-xG

np:G-xG

ENG-Premier League

2021

Arsenal

29

38.0

53

455

141

31.0

11.97

3.71

0.1

0.33

16.9

23

6

6

53.5

49.0

0.11

-0.5

-2.0

Aston Villa

24

38.0

52

518

179

34.6

13.63

4.71

0.09

0.26

16.5

15

5

6

52.9

48.5

0.1

-0.9

-1.5

Brighton

27

38.0

39

476

129

27.1

12.53

3.39

0.07

0.26

16.6

14

6

9

51.6

44.8

0.1

-12.6

-11.8

Burnley

25

38.0

32

383

125

32.6

10.08

3.29

0.08

0.23

16.6

15

3

3

39.9

37.6

0.1

-7.9

-8.6

Chelsea

27

38.0

56

553

194

35.1

14.55

5.11

0.09

0.25

16.3

16

8

10

64.0

56.4

0.1

-8.0

-8.4

By default, the data for all available leagues and the five most recent seasons will be retrieved. However, in most cases, you would want to limit the data to specific leagues and / or seasons. This can be done by passing a list of leagues and seasons to the constructor of the scraper class. For example:

# Create scraper class instance filtering on specific leagues and seasons
fbref = sd.FBref(leagues=['ENG-Premier League'], seasons=['1718', '1819'])
# Retrieve data for the specified leagues and seasons
season_stats = fbref.read_team_season_stats(stat_type='shooting')

Note that only a limited number of leagues are supported out-of-the-box. The leagues available for each source can be listed with the available_leagues() class method.

sd.FBref.available_leagues()
>>> ['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'ITA-Serie A']

You can add more leagues but there are no guarantees that they will be scraped correctly.

Data caching

Data caching is used to speed up the runtime and to prevent exceeding the rate limit of web servers. By default, all downloaded data is cached to ~/soccerdata on Linux and Mac OS, and to C:\Users\yourusername\soccerdata on Windows. A custom location can be set if desired. You can configure this using environment variables (see below) or on the level of an individual scraper by setting the data_dir parameter when creating the scraper class instance:

# Create scraper class instance with custom caching directory
fbref = sd.FBref(data_dir="/tmp/FBref")

This directory can be deleted at any time to reclaim disk space. However, this also means you will have to redownload the same data again if you need it, which will lead to reduced performance.

SoccerData has no knowledge of when the data on the server changes, so it is up to the user to decide when to refresh the cache. This can be done by deleting the cache directory or by setting the no_cache option to True when creating the scraper class instance:

# Create scraper class instance which always re-downloads the latest data
fbref = sd.FBref(no_cache=True)

Some methods will assume the cache is always out-of-date (for example, when scraping the fixture of the current season). Typically, these methods will have a force_cache option that can be set to True to force the cached data to be used. For example:

fbref = sd.FBref(leagues=['ENG-Premier League'], seasons=['2324'])
fbref.read_schedule(force_cache=True)

Caching can also be disabled entirely by setting the no_store option to True when creating the scraper class instance. However, it should almost always be left enabled.

# Create scraper class instance with caching disabled
fbref = sd.FBref(no_store=True)

Global configuration

Several settings can be configured globally using the following environment variables:

SOCCERDATA_DIR

The directory where the downloaded data is cached and where logs are stored. By default, all data is stored to ~/soccerdata on Linux / Mac OS and C:\Users\yourusername\soccerdata on Windows.

SOCCERDATA_NOCACHE

If set to “true”, no cached data is returned. Note that no-cache does not mean “don’t cache”. All downloaded data is still cached and overwrites existing caches. If the sense of “don’t cache” that you want is actually “don’t store”, then SOCCERDATA_NOSTORE is the option to use. By default, data is retrieved from the cache.

SOCCERDATA_NOSTORE

If set to “true”, no data is stored. By default, data is cached.

SOCCERDATA_MAXAGE

The maximum age of cached data in seconds. If the cached data is older than this, it will be re-downloaded. By default, this is set to infinity.

SOCCERDATA_LOGLEVEL

The level of logging to use. By default, this is set to “INFO”.

Example:

# bash
export SOCCERDATA_DIR = "~/soccerdata"
export SOCCERDATA_NOCACHE = "False"
export SOCCERDATA_NOSTORE = "False"
export SOCCERDATA_LOGLEVEL = "INFO"

Uniform team names

Each data source uses a different set of team names, which makes it difficult to combine data from multiple sources. To mitigate this, SoccerData allows translating the team names to uniform names. This is done by providing a SOCCERDATA_DIR/config/team_dict.json file. This file should contain a mapping between a generic name for each team and the team name used by each data source that you want to support. The example below will map “Tottenham Hotspur”, “Tottenham Hotspur FC” and “Spurs” to “Tottenham” in all scraped data.

{
  "Tottenham": ["Tottenham Hotspur", "Tottenham Hotspur FC", "Spurs"],
}

Additional setup for scraping WhoScored data

WhoScored implements strong protection against scraping using Incapsula. To circumvent this, this scraper uses Selenium with the ChromeDriver extension to emulate a real user. Before using this scraper, you will have to install Chrome. A Selenium driver matching your Chrome version will be downloaded automatically when you run the scraper.

Next steps

Look at you! You’re now basically an expert at SoccerData! ✨

From this point you can:

  • Look at the example notebooks for each Data source.

  • Take a deep dive into the API.

  • Give us feedback or contribute, see Contributing.

Have fun! 🎉

Overview of Data Sources

Currently, the following data sources are supported.


ClubElo

URL | Example usage | API reference

from soccerdata import ClubElo

Team’s relative strengths as Elo ratings, for most European leagues. Recalculated after every round, includes history.


ESPN

URL | Example usage | API reference

from soccerdata import ESPN

Historical results, statistics and lineups.


FBref

URL | Example usage | API reference

from soccerdata import FBref

Historical results, lineups, and detailed aggregated statistics for teams and individual players based on Stats Perform data.


FiveThirtyEight

URL | Example usage | API reference

from soccerdata import FiveThirtyEight

Team’s relative strengths as SPI ratings, predictions and results for the top European and American leagues.


Football-Data.co.uk

URL | Example usage | API reference

from soccerdata import MatchHistory

Historical results, betting odds and match statistics. Level of detail depends on league.


Sofascore

URL | Example usage | API reference

from soccerdata import Sofascore

Results, schedules, lineups, and detailed statistics for teams and individual players.


SoFIFA

URL | Example usage | API reference

from soccerdata import SoFIFA

Detailed scores on all player’s abilities from EA Sports FC.


Understat

URL | Example usage | API reference

from soccerdata import Understat

Advanced statistics such as xG, xGBuildup and xGChain, and shot events with associated xG values for the top European leagues.


WhoScored

URL | Example usage | API reference

from soccerdata import WhoScored

Historical results, match preview data and detailed Opta event stream data for major leagues.

This page was generated from doc/datasources/ClubElo.ipynb.
You can download the notebook,
[2]:
import soccerdata as sd

ClubElo

[3]:
elo = sd.ClubElo()
print(elo.__doc__)
Provides pd.DataFrames from CSV API at http://api.clubelo.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/ClubElo``.

    Since the source does not provide league names, this class will not filter
    by league. League names will be inserted from the other sources where
    available. Leagues that are only covered by clubelo.com will have NaN
    values.

    Parameters
    ----------
    proxy : 'tor' or or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

ELO scores for all teams at specified date

[4]:
current_elo = elo.read_by_date()
current_elo.head()
[4]:
rank country level elo from to league
team
Liverpool 1.0 ENG 1 2047.083862 2022-04-20 2022-04-24 ENG-Premier League
Man City 2.0 ENG 1 2037.059937 2022-04-21 2022-04-23 ENG-Premier League
Bayern 3.0 GER 1 1984.775391 2022-04-18 2022-04-23 GER-Bundesliga
Real Madrid 4.0 ESP 1 1969.584351 2022-04-21 2022-04-26 ESP-La Liga
Chelsea 5.0 ENG 1 1921.101440 2022-04-21 2022-04-24 ENG-Premier League

Full ELO history for one club

[5]:
barca_elo = elo.read_team_history("Barcelona")
barca_elo.head()
[5]:
rank team country level elo to
from
1939-10-22 NaN Barcelona ESP 1 1636.704590 1939-12-03
1939-12-04 NaN Barcelona ESP 1 1626.102173 1939-12-10
1939-12-11 NaN Barcelona ESP 1 1636.728271 1939-12-17
1939-12-18 NaN Barcelona ESP 1 1646.951660 1939-12-24
1939-12-25 NaN Barcelona ESP 1 1637.424316 1939-12-31
This page was generated from doc/datasources/ESPN.ipynb.
You can download the notebook,
[2]:
import soccerdata as sd

ESPN

[3]:
espn = sd.ESPN(leagues="ENG-Premier League", seasons=2021)
print(espn.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:466: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
  warnings.warn(msg)
Provides pd.DataFrames from JSON api available at http://site.api.espn.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/ESPN``.

    Parameters
    ----------
    leagues : string or iterable, optional
        IDs of leagues to include.

    seasons : string, int or list, optional
        Seasons to include. Supports multiple formats.
        Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
    proxy : 'tor' or or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

Game schedule

[4]:
epl_schedule = espn.read_schedule()
epl_schedule.head()
[4]:
date home_team away_team game_id league_id
league season game
ENG-Premier League 2021 2019-08-09 Liverpool-Norwich City 2019-08-09 19:00:00+00:00 Liverpool Norwich City 541844 eng.1
2019-08-10 AFC Bournemouth-Sheffield United 2019-08-10 14:00:00+00:00 AFC Bournemouth Sheffield United 541840 eng.1
2019-08-10 Burnley-Southampton 2019-08-10 14:00:00+00:00 Burnley Southampton 541841 eng.1
2019-08-10 Crystal Palace-Everton 2019-08-10 14:00:00+00:00 Crystal Palace Everton 541839 eng.1
2019-08-10 Tottenham Hotspur-Aston Villa 2019-08-10 16:30:00+00:00 Tottenham Hotspur Aston Villa 541837 eng.1

Match sheet data

[5]:
matchsheet = espn.read_matchsheet(match_id=541465)
matchsheet.head()
[5]:
is_home venue attendance capacity roster fouls_committed yellow_cards red_cards offsides won_corners saves possession_pct total_shots shots_on_target
league season game team
ENG-Premier League 2021 2020-07-26 West Ham United-Aston Villa Aston Villa False None 0 None [{'active': True, 'starter': True, 'jersey': '... 16 2 0 0 0 3 63 10 1
West Ham United True None 0 None [{'active': True, 'starter': True, 'jersey': '... 13 1 0 1 7 0 37 13 4

Line ups

[6]:
lineups = espn.read_lineup(match_id=541465)
lineups.head()
[6]:
is_home position formation_place sub_in sub_out appearances fouls_committed fouls_suffered own_goals red_cards sub_ins yellow_cards goal_assists shots_on_target total_goals total_shots goals_conceded saves shots_faced offsides
league season game team player
ENG-Premier League 2021 2020-07-26 West Ham United-Aston Villa Aston Villa Anwar El Ghazi False Substitute 0 90 end 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0
Conor Hourihane False Center Left Midfielder 10 start 76 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0
Douglas Luiz False Defensive Midfielder 4 start end 1.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN 0.0 0.0
Ezri Konsa False Center Right Defender 5 start end 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 NaN 0.0 0.0
Frédéric Guilbert False Right Back 2 start 76 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 NaN 0.0 0.0
This page was generated from doc/datasources/FBref.ipynb.
You can download the notebook,
[3]:
import soccerdata as sd

FBref

[4]:
fbref = sd.FBref(leagues="ENG-Premier League", seasons=2021)
print(fbref.__doc__)
Provides pd.DataFrames from data at http://fbref.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/FBref``.

    Parameters
    ----------
    leagues : string or iterable, optional
        IDs of leagues to include. For efficiently reading data from the Top-5
        European leagues, use "Big 5 European Leagues Combined".
    seasons : string, int or list, optional
        Seasons to include. Supports multiple formats.
        Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
    proxy : 'tor' or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:471: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
  warnings.warn(msg, stacklevel=1)

Team season stats

[5]:
team_season_stats = fbref.read_team_season_stats(stat_type="passing")
team_season_stats.head()
[5]:
players_used 90s Total Short Medium Long Ast xAG xA A-xAG KP 1/3 PPA CrsPA PrgP url
Cmp Att Cmp% TotDist PrgDist Cmp Att Cmp% Cmp Att Cmp% Cmp Att Cmp%
league season team
ENG-Premier League 2021 Arsenal 29 38 18041 21807 82.7 306094 93540 8319 9236 90.1 7770 8814 88.2 1453 2621 55.4 38 35.1 35.3 2.9 332 1377 325 67 1760 /en/squads/18bb7c10/2020-2021/Arsenal-Stats
Aston Villa 24 38 12938 16996 76.1 235927 93009 5729 6654 86.1 5222 6209 84.1 1571 3074 51.1 38 37.4 31.4 0.6 403 1086 323 76 1420 /en/squads/8602292d/2020-2021/Aston-Villa-Stats
Brighton 27 38 15772 19871 79.4 283213 95248 6694 7654 87.5 6893 7960 86.6 1689 3090 54.7 24 33.2 34.9 -9.2 358 1132 358 80 1516 /en/squads/d07537b9/2020-2021/Brighton-and-Hov...
Burnley 25 38 10825 15562 69.6 196816 82192 5117 6058 84.5 3688 4824 76.5 1553 3535 43.9 20 27.7 26.9 -7.7 267 923 292 80 1012 /en/squads/943e8050/2020-2021/Burnley-Stats
Chelsea 27 38 21969 25799 85.2 360450 112392 10454 11454 91.3 8965 9976 89.9 1667 2703 61.7 38 41.5 37.1 -3.5 448 1441 377 70 1798 /en/squads/cff3d9bb/2020-2021/Chelsea-Stats

Team match stats

[6]:
team_match_stats = fbref.read_team_match_stats(stat_type="schedule", team="Manchester City")
team_match_stats.head()
[6]:
date time round day venue result GF GA opponent xG xGA Poss Attendance Captain Formation Referee match_report Notes
league season team game
ENG-Premier League 2021 Manchester City 2020-09-21 Wolves-Manchester City 2020-09-21 20:15:00 Matchweek 2 Mon Away W 3 1 Wolves 1.9 0.6 65 <NA> Fernandinho 4-2-3-1 Andre Marriner /en/matches/1c17eca3/Wolverhampton-Wanderers-M... <NA>
2020-09-27 Manchester City-Leicester City 2020-09-27 16:30:00 Matchweek 3 Sun Home L 2 5 Leicester City 0.9 2.9 72 <NA> Fernandinho 4-2-3-1 Michael Oliver /en/matches/31c2a061/Manchester-City-Leicester... <NA>
2020-10-03 Leeds United-Manchester City 2020-10-03 17:30:00 Matchweek 4 Sat Away D 1 1 Leeds United 1.2 2.4 49 <NA> Kevin De Bruyne 4-3-3 Mike Dean /en/matches/5ce15b58/Leeds-United-Manchester-C... <NA>
2020-10-17 Manchester City-Arsenal 2020-10-17 17:30:00 Matchweek 5 Sat Home W 1 0 Arsenal 1.3 0.9 58 <NA> Raheem Sterling 3-1-4-2 Chris Kavanagh /en/matches/e95b8546/Manchester-City-Arsenal-O... <NA>
2020-10-24 West Ham-Manchester City 2020-10-24 12:30:00 Matchweek 6 Sat Away D 1 1 West Ham 1.0 0.3 69 <NA> Raheem Sterling 4-3-3 Anthony Taylor /en/matches/2b0c0eca/West-Ham-United-Mancheste... <NA>

Player season stats

[7]:
player_season_stats = fbref.read_player_season_stats(stat_type="standard")
player_season_stats.head()
[7]:
nation pos age born Playing Time Performance Expected Progression Per 90 Minutes
MP Starts Min 90s Gls Ast G+A G-PK PK PKatt CrdY CrdR xG npxG xAG npxG+xAG PrgC PrgP PrgR Gls Ast G+A G-PK G+A-PK xG xAG xG+xAG npxG npxG+xAG
league season team player
ENG-Premier League 2021 Arsenal Ainsley Maitland-Niles ENG MF,DF 22 1997 11 5 490 5.4 0 0 0 0 0 0 0 0 0.1 0.1 0.1 0.2 12 24 21 0.0 0.0 0.0 0.0 0.0 0.01 0.02 0.04 0.01 0.04
Alexandre Lacazette FRA FW 29 1991 31 22 1923 21.4 13 2 15 10 3 3 3 0 11.7 9.3 2.2 11.5 15 55 112 0.61 0.09 0.7 0.47 0.56 0.55 0.1 0.65 0.43 0.54
Bernd Leno GER GK 28 1992 35 35 3131 34.8 0 0 0 0 0 0 0 1 0.0 0.0 0.0 0.0 0 2 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bukayo Saka ENG FW,MF 18 2001 32 30 2553 28.4 5 3 8 5 0 0 1 0 6.9 6.9 4.4 11.3 85 101 269 0.18 0.11 0.28 0.18 0.28 0.24 0.16 0.4 0.24 0.4
Calum Chambers ENG DF 25 1995 10 8 753 8.4 0 2 2 0 0 0 0 0 0.4 0.4 1.1 1.5 16 40 49 0.0 0.24 0.24 0.0 0.24 0.05 0.13 0.18 0.05 0.18

Player match stats

[8]:
player_match_stats = fbref.read_player_match_stats(stat_type="passing", match_id='db261cb0')
player_match_stats.head()
[8]:
jersey_number nation pos age min Total Short Medium Long Ast xAG xA KP 1/3 PPA CrsPA PrgP game_id
Cmp Att Cmp% TotDist PrgDist Cmp Att Cmp% Cmp Att Cmp% Cmp Att Cmp%
league season game team player
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton Crystal Palace Andros Townsend 10 ENG RM 29-058 90 13 30 43.3 184 68 8 11 72.7 4 8 50.0 1 6 16.7 1 0.6 0.2 2 0 1 0 1 db261cb0
Cheikhou Kouyaté 8 SEN CB 30-266 90 11 19 57.9 207 109 3 4 75.0 3 6 50.0 3 6 50.0 0 0.0 0.0 0 1 0 0 2 db261cb0
Eberechi Eze 25 ENG LM 22-075 10 8 10 80.0 177 46 2 2 100.0 3 3 100.0 2 3 66.7 0 0.0 0.0 0 1 0 0 2 db261cb0
James McArthur 18 SCO CM 32-341 90 15 32 46.9 322 159 2 6 33.3 11 15 73.3 1 7 14.3 0 0.4 0.0 1 2 0 0 3 db261cb0
James McCarthy 22 IRL CM 29-305 73 14 23 60.9 281 75 4 6 66.7 6 10 60.0 3 5 60.0 0 0.0 0.1 0 0 0 0 2 db261cb0

Game schedule

[9]:
epl_schedule = fbref.read_schedule()
epl_schedule.head()
[9]:
week day date time home_team home_xg score away_xg away_team attendance venue referee match_report notes game_id
league season game
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton 1 Sat 2020-09-12 15:00 Crystal Palace 1.1 1–0 0.9 Southampton <NA> Selhurst Park Jonathan Moss /en/matches/db261cb0/Crystal-Palace-Southampto... <NA> db261cb0
2020-09-12 Fulham-Arsenal 1 Sat 2020-09-12 12:30 Fulham 0.1 0–3 1.9 Arsenal <NA> Craven Cottage Chris Kavanagh /en/matches/bf52349b/Fulham-Arsenal-September-... <NA> bf52349b
2020-09-12 Liverpool-Leeds United 1 Sat 2020-09-12 17:30 Liverpool 2.7 4–3 0.3 Leeds United <NA> Anfield Michael Oliver /en/matches/21b58926/Liverpool-Leeds-United-Se... <NA> 21b58926
2020-09-12 West Ham-Newcastle Utd 1 Sat 2020-09-12 20:00 West Ham 1.0 0–2 1.6 Newcastle Utd <NA> London Stadium Stuart Attwell /en/matches/78495ced/West-Ham-United-Newcastle... <NA> 78495ced
2020-09-13 Tottenham-Everton 1 Sun 2020-09-13 16:30 Tottenham 1.1 0–1 1.2 Everton <NA> Tottenham Hotspur Stadium Martin Atkinson /en/matches/fc7f9aa1/Tottenham-Hotspur-Everton... <NA> fc7f9aa1

Line ups

[10]:
lineups = fbref.read_lineup(match_id='db261cb0')
lineups.head()
[10]:
jersey_number player team is_starter position minutes_played
league season game
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton 31 Vicente Guaita Crystal Palace True GK 90
2020-09-12 Crystal Palace-Southampton 2 Joel Ward Crystal Palace True RB 90
2020-09-12 Crystal Palace-Southampton 6 Scott Dann Crystal Palace True CB 90
2020-09-12 Crystal Palace-Southampton 8 Cheikhou Kouyaté Crystal Palace True CB 90
2020-09-12 Crystal Palace-Southampton 9 Jordan Ayew Crystal Palace True FW 90

Events

[11]:
events = fbref.read_events(match_id='db261cb0')
events.head()
[11]:
team minute score player1 player2 event_type
league season game
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton Crystal Palace 13 1:0 Wilfried Zaha Andros Townsend goal
2020-09-12 Crystal Palace-Southampton Crystal Palace 41 1:0 Jeffrey Schlupp None yellow_card
2020-09-12 Crystal Palace-Southampton Southampton 46 1:0 Jannik Vestergaard Jan Bednarek substitute_in
2020-09-12 Crystal Palace-Southampton Southampton 51 1:0 Kyle Walker-Peters None yellow_card
2020-09-12 Crystal Palace-Southampton Crystal Palace 58 1:0 James McCarthy None yellow_card

Shot events

[12]:
shots = fbref.read_shot_events(match_id='db261cb0')
shots.head()
[12]:
minute player team xG PSxG outcome distance body_part notes SCA 1 SCA 2
player event player event
league season game
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton 4 Jack Stephens Southampton 0.07 0.09 Saved 10 Head <NA> James Ward-Prowse Pass (Dead) Kyle Walker-Peters Pass (Dead)
2020-09-12 Crystal Palace-Southampton 13 Wilfried Zaha Crystal Palace 0.44 0.73 Goal 9 Right Foot Volley Andros Townsend Pass (Live) James McCarthy Interception
2020-09-12 Crystal Palace-Southampton 32 Danny Ings Southampton 0.04 0.12 Saved 10 Head <NA> Kyle Walker-Peters Pass (Live) William Smallbone Pass (Live)
2020-09-12 Crystal Palace-Southampton 36 Cheikhou Kouyaté Crystal Palace 0.15 0.09 Saved 11 Right Foot <NA> Andros Townsend Pass (Dead) Wilfried Zaha Fouled
2020-09-12 Crystal Palace-Southampton 46 Nathan Redmond Southampton 0.07 0.03 Saved 14 Right Foot <NA> Che Adams Pass (Live) Danny Ings Pass (Live)
This page was generated from doc/datasources/FiveThirtyEight.ipynb.
You can download the notebook,
[3]:
import soccerdata as sd

FiveThirtyEight

[4]:
five38 = sd.FiveThirtyEight(leagues="ENG-Premier League", seasons=2021)
print(five38.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:466: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
  warnings.warn(msg)
Provides pd.DataFrames from fivethirtyeight's "Club Soccer Predictions" project.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/FiveThirtyEight``.

    Original project and background info:
    https://projects.fivethirtyeight.com/soccer-predictions/ and
    https://fivethirtyeight.com/features/how-our-club-soccer-projections-work/


    Parameters
    ----------
    leagues : string or iterable, optional
        IDs of Leagues to include.
    seasons : string, int or list, optional
        Seasons to include. Supports multiple formats.
        Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
    proxy : 'tor' or or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

Game schedule and predicted results

[5]:
epl_schedule = five38.read_games()
epl_schedule.head()
[5]:
date status leg home_team away_team home_id away_id home_code away_code prob_home prob_away prob_tie round matchday score_home score_away adj_score_home adj_score_away chances_home chances_away moves_home moves_away aggregate_winner shootout_winner
league season game
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton 2020-09-12 14:00:00+00:00 post None Crystal Palace Southampton 384 376 CRY SOU 0.36979 0.35311 0.27710 None None 1 0 1.05 0.000 1.507 1.160 0.943 1.652 None None
2020-09-12 Fulham-Arsenal 2020-09-12 11:30:00+00:00 post None Fulham Arsenal 370 359 FUL ARS 0.28167 0.46081 0.25752 None None 0 3 0.00 3.150 0.141 2.989 0.270 1.528 None None
2020-09-12 Liverpool-Leeds United 2020-09-12 16:30:00+00:00 post None Liverpool Leeds United 364 357 LIV LEE 0.83582 0.04384 0.12034 None None 4 3 4.20 3.150 3.328 0.368 3.006 0.747 None None
2020-09-12 West Ham United-Newcastle 2020-09-12 19:00:00+00:00 post None West Ham United Newcastle 371 361 WHU NEW 0.48865 0.25556 0.25579 None None 0 2 0.00 1.666 0.973 1.475 1.103 1.860 None None
2020-09-13 Tottenham Hotspur-Everton 2020-09-13 15:30:00+00:00 post None Tottenham Hotspur Everton 367 368 TOT EVE 0.50235 0.24434 0.25331 None None 0 1 0.00 1.050 1.029 1.109 1.316 1.110 None None

Forecasted league table

[6]:
forecasts = five38.read_forecasts()
forecasts.head()
[6]:
code current_losses current_points current_ties current_wins d_rating global_d global_o global_rating goal_diff goals_against goals_scored id losses make_champs make_europa make_playoffs o_rating points position_1 position_10 position_11 position_12 position_13 position_14 position_15 position_16 position_17 position_18 position_19 position_2 position_20 position_3 position_4 position_5 position_6 position_7 position_8 position_9 promoted relegated sdr_id ties win_league wins
league season last_updated team
ENG-Premier League 2021 2020-09-09T14:46:49.136Z Arsenal ARS 0.0 0.0 0.0 0.0 1.33114 0.68339 2.13421 77.16360 4.13260 49.72715 53.85975 359 13.27675 0.16095 0.0 0.0 1.48647 55.50395 0.00720 0.06905 0.06170 0.05220 0.04835 0.04075 0.03595 0.03045 0.02650 0.02035 0.01590 0.02725 0.01165 0.05280 0.07370 0.08895 0.09120 0.08990 0.08340 0.07275 0.0 0.04790 73616 9.3329 0.00720 15.39035
Aston Villa AVL 0.0 0.0 0.0 0.0 1.52263 0.87488 1.77744 66.48339 -17.57325 57.65595 40.08270 362 17.72125 0.02335 0.0 0.0 1.12969 41.66985 0.00030 0.05405 0.06005 0.06500 0.06855 0.07090 0.07480 0.07685 0.07955 0.08465 0.08315 0.00315 0.08465 0.00615 0.01375 0.02105 0.02830 0.03375 0.04145 0.04990 0.0 0.25245 73597 9.5832 0.00030 10.69555
Brighton and Hove Albion BHA 0.0 0.0 0.0 0.0 1.45713 0.80938 1.77377 68.02657 -15.24810 55.36300 40.11490 331 17.18525 0.02950 0.0 0.0 1.12603 42.92225 0.00075 0.05475 0.06255 0.06575 0.07025 0.07305 0.07485 0.07690 0.07505 0.07790 0.07590 0.00295 0.06700 0.00965 0.01615 0.02415 0.03275 0.04185 0.04450 0.05330 0.0 0.22080 73651 9.7610 0.00075 11.05375
Burnley BRN 0.0 0.0 0.0 0.0 1.41912 0.77137 1.77961 69.08173 -13.63310 53.84085 40.20775 379 16.80205 0.03210 0.0 0.0 1.13187 43.86545 0.00100 0.05910 0.06525 0.06555 0.07245 0.06670 0.07370 0.07325 0.07340 0.07010 0.06735 0.00325 0.06145 0.01065 0.01720 0.02915 0.03935 0.04365 0.05110 0.05635 0.0 0.19890 73618 9.8642 0.00100 11.33375
Chelsea CHE 0.0 0.0 0.0 0.0 1.16934 0.52160 2.51615 85.37204 26.57990 42.20305 68.78295 363 9.34340 0.53090 0.0 0.0 1.86841 69.17900 0.06525 0.02775 0.02295 0.01715 0.01315 0.01000 0.00735 0.00585 0.00400 0.00230 0.00215 0.13860 0.00050 0.17560 0.15145 0.11230 0.08610 0.06630 0.04965 0.04160 0.0 0.00495 73598 8.3954 0.06525 20.26120

Clinches

[7]:
clinches = five38.read_clinches()
clinches.head()
[7]:
typ team
league season date
ENG-Premier League 2021 2021-02-13 relegated_elim Manchester City
2021-02-17 win_league_elim Sheffield United
2021-02-17 win_league_elim West Bromwich Albion
2021-02-27 win_league_elim Fulham
2021-03-02 win_league_elim Newcastle
This page was generated from doc/datasources/FotMob.ipynb.
You can download the notebook,
[1]:
import pandas as pd
pd.set_option('display.max_columns', None)
[3]:
import soccerdata as sd

FotMob

[4]:
fotmob = sd.FotMob(leagues='ESP-La Liga', seasons='2022/2023')
print(fotmob.__doc__)
Provides pd.DataFrames from data available at http://www.fotmob.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/FotMob``.

    Parameters
    ----------
    leagues : string or iterable, optional
        IDs of Leagues to include.
    seasons : string, int or list, optional
        Seasons to include. Supports multiple formats.
        Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
    proxy : 'tor' or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - 'tor': Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

Read league table

[5]:
league_table = fotmob.read_league_table()
league_table.head()
[5]:
team MP W D L GF GA GD Pts
league season
ESP-La Liga 2223 Barcelona 38 28 4 6 70 20 50 88
2223 Real Madrid 38 24 6 8 75 36 39 78
2223 Atletico Madrid 38 23 8 7 70 33 37 77
2223 Real Sociedad 38 21 8 9 51 35 16 71
2223 Villarreal 38 19 7 12 59 40 19 64

Read schedule

[6]:
schedule = fotmob.read_schedule()
schedule.head()
[6]:
round week date home_team away_team home_score away_score status game_id url
league season game
ESP-La Liga 2223 2022-08-12 Osasuna-Sevilla None None 2022-08-12 19:00:00+00:00 Osasuna Sevilla 2 1 FT 3918013 https://fotmob.com/matches/sevilla-vs-osasuna/...
2022-08-13 Barcelona-Rayo Vallecano None None 2022-08-13 19:00:00+00:00 Barcelona Rayo Vallecano 0 0 FT 3918009 https://fotmob.com/matches/rayo-vallecano-vs-b...
2022-08-13 Celta Vigo-Espanyol None None 2022-08-13 15:00:00+00:00 Celta Vigo Espanyol 2 2 FT 3918011 https://fotmob.com/matches/espanyol-vs-celta-v...
2022-08-13 Real Valladolid-Villarreal None None 2022-08-13 17:00:00+00:00 Real Valladolid Villarreal 0 3 FT 3918016 https://fotmob.com/matches/villarreal-vs-real-...
2022-08-14 Almeria-Real Madrid None None 2022-08-14 20:00:00+00:00 Almeria Real Madrid 1 2 FT 3918014 https://fotmob.com/matches/real-madrid-vs-alme...

Read team match stats

[7]:
match_stats = fotmob.read_team_match_stats(opponent_stats=False, team='Valencia')
match_stats.head()
[7]:
Accurate passes Ball possession Big chances Big chances missed Corners Expected goals (xG) Fouls committed Shots on target Total shots Accurate passes (%)
league season game team
ESP-La Liga 2223 2022-08-14 Valencia-Girona Valencia 285 46 2 1 4 1.62 18 2 17 0.81
2022-08-21 Athletic Club-Valencia Valencia 432 63 1 1 3 0.73 15 0 10 0.85
2022-08-29 Valencia-Atletico Madrid Valencia 513 70 0 0 6 0.51 15 2 12 0.88
2022-09-04 Valencia-Getafe Valencia 482 70 4 1 9 2.30 11 9 24 0.88
2022-09-10 Rayo Vallecano-Valencia Valencia 266 54 3 2 7 1.80 11 4 17 0.77
This page was generated from doc/datasources/MatchHistory.ipynb.
You can download the notebook,
[3]:
import soccerdata as sd

Match History

[4]:
mh = sd.MatchHistory(leagues="ENG-Premier League", seasons=2021)
print(mh.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:466: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
  warnings.warn(msg)
Provides pd.DataFrames from CSV files available at http://www.football-data.co.uk/data.php.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/MatchHistory``.

    Parameters
    ----------
    leagues : string or iterable
        IDs of leagues to include.
    seasons : string, int or list
        Seasons to include. Supports multiple formats.
        Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
    proxy : 'tor' or or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path, optional
        Path to directory where data will be cached.

Historic match results and betting odds

[5]:
hist = mh.read_games()
hist.head()
[5]:
date home_team away_team FTHG FTAG FTR HTHG HTAG HTR referee HS AS HST AST HF AF HC AC HY AY HR AR B365H B365D B365A BWH BWD BWA IWH IWD IWA PSH PSD PSA WHH WHD WHA VCH VCD VCA MaxH MaxD MaxA AvgH AvgD AvgA B365>2.5 B365<2.5 P>2.5 P<2.5 Max>2.5 Max<2.5 Avg>2.5 Avg<2.5 AHh B365AHH B365AHA PAHH PAHA MaxAHH MaxAHA AvgAHH AvgAHA B365CH B365CD B365CA BWCH BWCD BWCA IWCH IWCD IWCA PSCH PSCD PSCA WHCH WHCD WHCA VCCH VCCD VCCA MaxCH MaxCD MaxCA AvgCH AvgCD AvgCA B365C>2.5 B365C<2.5 PC>2.5 PC<2.5 MaxC>2.5 MaxC<2.5 AvgC>2.5 AvgC<2.5 AHCh B365CAHH B365CAHA PCAHH PCAHA MaxCAHH MaxCAHA AvgCAHH AvgCAHA
league season game
ENG-Premier League 2021 2020-01-11 Aston Villa-Southampton 2020-01-11 12:00:00 Aston Villa Southampton 3 4 A 0 3 A D England 19 9 10 4 12 17 11 1 1 2 0 0 2.45 3.5 2.80 2.35 3.60 2.85 2.50 3.35 2.75 2.53 3.72 2.81 2.50 3.5 2.75 2.40 3.60 2.80 2.60 3.84 2.85 2.46 3.60 2.78 1.66 2.2 1.66 2.37 1.69 2.38 1.65 2.27 0.00 1.83 2.07 1.87 2.07 1.87 2.09 1.83 2.05 2.60 3.50 2.62 2.50 3.60 2.70 2.55 3.50 2.70 2.69 3.56 2.73 2.60 3.50 2.62 2.50 3.60 2.70 2.70 3.70 2.80 2.57 3.55 2.69 1.66 2.20 1.70 2.29 1.71 2.37 1.66 2.24 0.00 1.93 1.97 1.95 1.98 1.97 2.04 1.89 1.98
2020-01-11 Man United-Arsenal 2020-01-11 16:30:00 Man United Arsenal 0 1 A 0 0 D M Dean 8 7 2 2 12 12 6 3 3 3 0 0 2.00 3.6 3.60 2.00 3.70 3.60 2.05 3.45 3.55 2.05 3.78 3.73 2.05 3.6 3.60 2.05 3.60 3.50 2.10 3.85 3.76 2.04 3.66 3.58 1.72 2.1 1.76 2.19 1.81 2.22 1.74 2.12 -0.50 2.05 1.85 2.05 1.88 2.09 1.90 2.04 1.84 1.95 3.60 3.80 1.95 3.60 3.90 2.00 3.50 3.80 2.02 3.65 3.97 2.00 3.50 3.80 1.95 3.50 4.00 2.06 3.75 4.00 1.99 3.59 3.87 1.90 1.90 1.96 1.96 1.99 2.10 1.88 1.95 -0.50 2.00 1.90 2.03 1.90 2.05 1.92 2.00 1.88
2020-01-11 Newcastle-Everton 2020-01-11 14:00:00 Newcastle Everton 2 1 H 0 0 D S Attwell 11 15 4 4 9 10 5 4 2 4 0 0 3.70 3.4 2.05 3.75 3.40 2.05 3.55 3.30 2.10 3.82 3.56 2.09 3.70 3.4 2.05 3.50 3.50 2.05 4.00 3.59 2.15 3.71 3.47 2.06 1.80 2.0 1.86 2.07 1.88 2.08 1.82 2.01 0.25 2.05 1.75 2.15 1.80 2.17 1.83 2.12 1.77 3.30 3.30 2.25 3.30 3.30 2.25 3.15 3.35 2.30 3.34 3.43 2.33 3.40 3.25 2.25 3.30 3.30 2.25 3.62 3.50 2.34 3.34 3.33 2.27 2.00 1.80 2.12 1.81 2.14 1.92 2.04 1.80 0.25 1.95 1.95 1.93 2.00 1.99 2.03 1.93 1.95
2020-01-11 Tottenham-Brighton 2020-01-11 19:15:00 Tottenham Brighton 2 1 H 1 0 H G Scott 9 6 3 2 14 13 4 5 2 1 0 0 1.61 4.0 5.50 1.62 4.25 5.00 1.65 3.90 5.00 1.65 4.26 5.54 1.63 4.0 5.25 1.62 4.20 5.25 1.67 4.30 5.75 1.63 4.17 5.29 1.66 2.2 1.68 2.32 1.71 2.35 1.65 2.26 -1.00 2.02 1.77 2.15 1.80 2.16 1.84 2.09 1.80 1.55 4.33 6.00 1.57 4.25 5.50 1.60 4.20 5.75 1.61 4.36 5.77 1.57 4.00 6.00 1.57 4.20 5.75 1.63 4.45 6.39 1.59 4.29 5.62 1.66 2.20 1.68 2.32 1.72 2.37 1.65 2.27 -1.00 2.06 1.84 2.11 1.84 2.13 1.99 2.02 1.86
2020-02-11 Fulham-West Brom 2020-02-11 17:30:00 Fulham West Brom 2 0 H 2 0 H S Hooper 13 10 6 1 13 11 5 2 3 2 0 0 2.40 3.3 3.00 2.40 3.30 3.00 2.45 3.10 3.00 2.47 3.40 3.13 2.40 3.3 3.00 2.45 3.25 3.00 2.56 3.50 3.17 2.44 3.32 3.01 2.00 1.8 2.13 1.80 2.14 1.85 2.06 1.78 -0.25 2.10 1.80 2.11 1.84 2.15 1.86 2.10 1.80 2.40 3.30 3.00 2.45 3.30 2.95 2.40 3.05 3.10 2.48 3.32 3.19 2.40 3.20 3.10 2.40 3.25 3.10 2.48 3.45 3.22 2.41 3.27 3.09 2.05 1.85 2.05 1.88 2.11 1.92 2.02 1.81 -0.25 2.06 1.84 2.11 1.84 2.12 1.87 2.08 1.82
This page was generated from doc/datasources/SoFIFA.ipynb.
You can download the notebook,
[3]:
import soccerdata as sd

SoFIFA

[4]:
sofifa = sd.SoFIFA(leagues="ENG-Premier League", versions="latest")
print(sofifa.__doc__)
Provides pd.DataFrames from data at http://sofifa.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/SoFIFA``.

    Parameters
    ----------
    leagues : string or iterable, optional
        IDs of leagues to include.
    versions : string, int or list of int, optional
        FIFA releases to include. Should be specified by their ID used in the URL
        (e.g., 230034). Alternatively, the string "all" can be used to include all
        versions and "latest" to include the latest version only. Defaults to
        "latest".
    proxy : 'tor' or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

EA Sports FIFA team ratings

[5]:
team_ratings = sofifa.read_team_ratings()
team_ratings.head()
[5]:
overall attack midfield defence transfer_budget players fifa_edition update
league team
ENG-Premier League AFC Bournemouth 74 75 73 73 €0 29 FIFA 23 May 26, 2023
Arsenal 82 82 84 81 €0 32 FIFA 23 May 26, 2023
Aston Villa 79 79 77 79 €0 32 FIFA 23 May 26, 2023
Brentford 75 74 75 76 €0 31 FIFA 23 May 26, 2023
Brighton & Hove Albion 77 75 77 78 €0 28 FIFA 23 May 26, 2023

EA Sports FIFA player ratings

[6]:
player_ratings = sofifa.read_player_ratings(team="Arsenal")
player_ratings.head()
[6]:
fifa_edition update overallrating potential crossing finishing headingaccuracy shortpassing volleys dribbling curve fk_accuracy longpassing ballcontrol acceleration sprintspeed agility reactions balance shotpower jumping stamina strength longshots aggression interceptions positioning vision penalties composure defensiveawareness standingtackle slidingtackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
player
Aaron Ramsdale FIFA 23 May 26, 2023 83 86 15 14 15 56 16 18 17 16 57 36 47 50 53 82 49 65 66 35 59 15 35 24 18 65 25 65 20 16 15 84 79 87 79 86
Ainsley Maitland-Niles FIFA 23 May 26, 2023 73 75 67 58 54 74 44 71 64 50 66 73 79 77 79 70 69 67 66 69 67 62 65 70 69 67 69 70 66 74 72 15 9 7 10 13
Albert-Mboyo Sambi Lokonga FIFA 23 May 26, 2023 75 82 68 53 61 79 58 78 71 68 77 81 63 69 78 72 68 68 71 77 63 63 72 77 65 75 41 80 64 73 68 14 7 6 13 9
Amario Cozier-Duberry FIFA 23 May 26, 2023 59 77 58 57 38 55 60 60 53 41 44 58 72 69 73 50 68 56 44 55 45 45 55 27 59 56 55 55 28 28 24 11 12 9 13 7
Arthur Okonkwo FIFA 23 May 26, 2023 66 75 10 6 12 33 7 9 12 20 20 17 44 44 40 63 31 47 50 29 50 7 20 11 6 35 19 30 13 13 12 64 65 63 67 66
[7]:
sofifa.read_teams()
[7]:
team league fifa_edition update
team_id
10 Manchester City ENG-Premier League FIFA 23 May 26, 2023
9 Liverpool ENG-Premier League FIFA 23 May 26, 2023
1 Arsenal ENG-Premier League FIFA 23 May 26, 2023
5 Chelsea ENG-Premier League FIFA 23 May 26, 2023
11 Manchester United ENG-Premier League FIFA 23 May 26, 2023
18 Tottenham Hotspur ENG-Premier League FIFA 23 May 26, 2023
13 Newcastle United ENG-Premier League FIFA 23 May 26, 2023
2 Aston Villa ENG-Premier League FIFA 23 May 26, 2023
19 West Ham United ENG-Premier League FIFA 23 May 26, 2023
95 Leicester City ENG-Premier League FIFA 23 May 26, 2023
110 Wolverhampton Wanderers ENG-Premier League FIFA 23 May 26, 2023
14 Nottingham Forest ENG-Premier League FIFA 23 May 26, 2023
1808 Brighton & Hove Albion ENG-Premier League FIFA 23 May 26, 2023
7 Everton ENG-Premier League FIFA 23 May 26, 2023
1799 Crystal Palace ENG-Premier League FIFA 23 May 26, 2023
144 Fulham ENG-Premier League FIFA 23 May 26, 2023
8 Leeds United ENG-Premier League FIFA 23 May 26, 2023
17 Southampton ENG-Premier League FIFA 23 May 26, 2023
1925 Brentford ENG-Premier League FIFA 23 May 26, 2023
1943 AFC Bournemouth ENG-Premier League FIFA 23 May 26, 2023
[8]:
sofifa.read_players(team="Arsenal")
[8]:
player team league fifa_edition update
player_id
233934 Aaron Ramsdale Arsenal ENG-Premier League FIFA 23 May 26, 2023
231936 Benjamin White Arsenal ENG-Premier League FIFA 23 May 26, 2023
243715 William Saliba Arsenal ENG-Premier League FIFA 23 May 26, 2023
232580 Gabriel dos S. Magalhães Arsenal ENG-Premier League FIFA 23 May 26, 2023
227813 Oleksandr Zinchenko Arsenal ENG-Premier League FIFA 23 May 26, 2023
205498 Luiz Frello Filho Jorge Arsenal ENG-Premier League FIFA 23 May 26, 2023
222665 Martin Ødegaard Arsenal ENG-Premier League FIFA 23 May 26, 2023
199503 Granit Xhaka Arsenal ENG-Premier League FIFA 23 May 26, 2023
246669 Bukayo Saka Arsenal ENG-Premier League FIFA 23 May 26, 2023
230666 Gabriel Fernando de Jesus Arsenal ENG-Premier League FIFA 23 May 26, 2023
251566 Gabriel Teodoro Martinelli Silva Arsenal ENG-Premier League FIFA 23 May 26, 2023
207421 Leandro Trossard Arsenal ENG-Premier League FIFA 23 May 26, 2023
236988 Eddie Nketiah Arsenal ENG-Premier League FIFA 23 May 26, 2023
209989 Thomas Partey Arsenal ENG-Premier League FIFA 23 May 26, 2023
256958 Fábio Daniel Ferreira Vieira Arsenal ENG-Premier League FIFA 23 May 26, 2023
226491 Kieran Tierney Arsenal ENG-Premier League FIFA 23 May 26, 2023
265693 Jakub Kiwior Arsenal ENG-Premier League FIFA 23 May 26, 2023
233267 Matt Turner Arsenal ENG-Premier League FIFA 23 May 26, 2023
228295 Rob Holding Arsenal ENG-Premier League FIFA 23 May 26, 2023
240273 Emile Smith Rowe Arsenal ENG-Premier League FIFA 23 May 26, 2023
231448 Reiss Nelson Arsenal ENG-Premier League FIFA 23 May 26, 2023
211454 Mohamed Elneny Arsenal ENG-Premier League FIFA 23 May 26, 2023
232938 Takehiro Tomiyasu Arsenal ENG-Premier League FIFA 23 May 26, 2023
247731 Matthew Smith Arsenal ENG-Premier League FIFA 23 May 26, 2023
248732 Karl Hein Arsenal ENG-Premier League FIFA 23 May 26, 2023
259403 Joel Ideho Arsenal ENG-Premier League FIFA 23 May 26, 2023
260141 Cătălin Cîrjan Arsenal ENG-Premier League FIFA 23 May 26, 2023
259095 George Lewis Arsenal ENG-Premier League FIFA 23 May 26, 2023
271743 Amario Cozier-Duberry Arsenal ENG-Premier League FIFA 23 May 26, 2023
271808 Lino Sousa Arsenal ENG-Premier League FIFA 23 May 26, 2023
259097 Salah-Eddine Oulad M'Hand Arsenal ENG-Premier League FIFA 23 May 26, 2023
274387 Reuell Walters Arsenal ENG-Premier League FIFA 23 May 26, 2023
270390 Marcus Vincius Oliveira Alencar Arsenal ENG-Premier League FIFA 23 May 26, 2023
260653 Brooke Norton-Cuffy Arsenal ENG-Premier League FIFA 23 May 26, 2023
260144 Charlie Patino Arsenal ENG-Premier League FIFA 23 May 26, 2023
251530 Nuno Albertino Varela Tavares Arsenal ENG-Premier League FIFA 23 May 26, 2023
247463 Folarin Balogun Arsenal ENG-Premier League FIFA 23 May 26, 2023
246671 Arthur Okonkwo Arsenal ENG-Premier League FIFA 23 May 26, 2023
241928 Albert Sambi Lokonga Arsenal ENG-Premier League FIFA 23 May 26, 2023
235755 Auston Trusty Arsenal ENG-Premier League FIFA 23 May 26, 2023
226110 Nicolas Pépé Arsenal ENG-Premier League FIFA 23 May 26, 2023
225782 Ainsley Maitland-Niles Arsenal ENG-Premier League FIFA 23 May 26, 2023
222562 Rúnar Alex Rúnarsson Arsenal ENG-Premier League FIFA 23 May 26, 2023
206654 Pablo Marí Villar Arsenal ENG-Premier League FIFA 23 May 26, 2023
201118 Cédric Ricardo Alves Soares Arsenal ENG-Premier League FIFA 23 May 26, 2023
This page was generated from doc/datasources/Understat.ipynb.
You can download the notebook,
[2]:
import soccerdata as sd

Understat

[3]:
understat = sd.Understat(leagues="ENG-Premier League", seasons="2015/2016")
print(understat.__doc__)
Provides pd.DataFrames from data at https://understat.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/Understat``.

    Parameters
    ----------
    proxy : 'tor' or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.

Leagues

[4]:
leagues = understat.read_leagues()
leagues.head()
[4]:
league_id url
league
ENG-Premier League 1 https://understat.com/league/EPL

Seasons

[5]:
seasons = understat.read_seasons()
seasons.head()
[5]:
league_id season_id url
league season
ENG-Premier League 1516 1 2015 https://understat.com/league/EPL/2015

Schedule

[6]:
schedule = understat.read_schedule()
schedule.head()
[6]:
league_id season_id game_id date home_team_id away_team_id home_team away_team away_team_code home_team_code home_goals away_goals home_xg away_xg is_result has_data url
league season game
ENG-Premier League 1516 2015-08-08 Bournemouth-Aston Villa 1 2015 82 2015-08-08 18:00:00 73 71 Bournemouth Aston Villa AVL BOU 0 1 0.876106 0.782253 True True https://understat.com/match/82
2015-08-08 Chelsea-Swansea 1 2015 86 2015-08-08 20:30:00 80 84 Chelsea Swansea SWA CHE 2 2 0.64396 2.59203 True True https://understat.com/match/86
2015-08-08 Everton-Watford 1 2015 83 2015-08-08 18:00:00 72 90 Everton Watford WAT EVE 2 2 0.604226 0.557892 True True https://understat.com/match/83
2015-08-08 Leicester-Sunderland 1 2015 84 2015-08-08 18:00:00 75 77 Leicester Sunderland SUN LEI 4 2 2.56803 1.45946 True True https://understat.com/match/84
2015-08-08 Manchester United-Tottenham 1 2015 81 2015-08-08 15:45:00 89 82 Manchester United Tottenham TOT MUN 1 0 0.627539 0.6746 True True https://understat.com/match/81

Team match stats

[7]:
team_match_stats = understat.read_team_match_stats()
team_match_stats.head()
[7]:
league_id season_id game_id date home_team_id away_team_id home_team away_team away_team_code home_team_code ... away_ppda away_deep_completions home_points home_expected_points home_goals home_xg home_np_xg home_np_xg_difference home_ppda home_deep_completions
league season game
ENG-Premier League 1516 2015-08-08 Bournemouth-Aston Villa 1 2015 82 2015-08-08 18:00:00 73 71 Bournemouth Aston Villa AVL BOU ... 11.846154 2 0 1.3912 0 0.876106 0.876106 0.093853 6.9 11
2015-08-08 Chelsea-Swansea 1 2015 86 2015-08-08 20:30:00 80 84 Chelsea Swansea SWA CHE ... 8.833333 5 1 0.1836 2 0.64396 0.64396 -1.1869 10.363636 10
2015-08-08 Everton-Watford 1 2015 83 2015-08-08 18:00:00 72 90 Everton Watford WAT EVE ... 17.157895 4 1 1.3301 2 0.604226 0.604226 0.046334 6.65 5
2015-08-08 Leicester-Sunderland 1 2015 84 2015-08-08 18:00:00 75 77 Leicester Sunderland SUN LEI ... 9.555556 6 3 2.1323 4 2.56803 1.80686 0.3474 10.88 5
2015-08-08 Manchester United-Tottenham 1 2015 81 2015-08-08 15:45:00 89 82 Manchester United Tottenham TOT MUN ... 8.21875 10 3 1.2482 1 0.627539 0.627539 -0.047061 13.826087 4

5 rows × 26 columns

Player season stats

[8]:
player_season_stats = understat.read_player_season_stats()
player_season_stats.head()
[8]:
league_id season_id team team_id player_id position matches minutes goals xg np_goals np_xg assists xa shots key_passes yellow_cards red_cards xg_chain xg_buildup
league season player
ENG-Premier League 1516 Aaron Cresswell 1 2015 West Ham 81 534 D M 37 3319 2 1.092563 2 1.092563 4 3.294618 28 40 1 0 10.339367 6.831139
Aaron Lennon 1 2015 Everton 72 593 F M S 25 1534 5 2.226095 5 2.226095 0 1.697951 18 23 1 0 5.535392 1.90374
Aaron Ramsey 1 2015 Arsenal 83 504 M S 31 2624 5 8.614323 5 8.614323 4 4.046242 69 37 4 0 23.282566 14.176444
Abdul Rahman Baba 1 2015 Chelsea 80 684 D S 15 1018 0 0.379115 0 0.379115 1 0.179241 6 3 1 0 3.869669 3.490554
Adam Bogdan 1 2015 Liverpool 87 698 GK 2 180 0 0.0 0 0.0 0 0.0 0 0 0 0 0.120216 0.120216

Player match stats

[9]:
player_match_stats = understat.read_player_match_stats()
player_match_stats.head()
[9]:
league_id season_id game_id team_id player_id position position_id minutes goals own_goals shots xg xa xg_chain xg_buildup
league season game team player
ENG-Premier League 1516 2015-08-08 Bournemouth-Aston Villa Aston Villa Ashley Westwood 1 2015 82 71 669 MC 9 90 0 0 0 0.0 0.374082 0.131937 0.131937
Brad Guzan 1 2015 82 71 662 GK 1 90 0 0 0 0.0 0.0 0.0 0.0
Carlos Sánchez 1 2015 82 71 667 Sub 17 18 0 0 0 0.0 0.0 0.0 0.0
Ciaran Clark 1 2015 82 71 875 DC 3 90 0 0 0 0.0 0.0 0.131937 0.131937
Gabriel Agbonlahor 1 2015 82 71 890 FW 15 90 0 0 2 0.13016 0.113668 0.243828 0.0

Shot events

[10]:
shot_events = understat.read_shot_events()
shot_events.head()
[10]:
league_id season_id game_id date shot_id team_id player_id assist_player_id assist_player xg location_x location_y minute body_part situation result
league season game team player
ENG-Premier League 1516 2015-08-08 Bournemouth-Aston Villa Aston Villa Gabriel Agbonlahor 1 2015 82 2015-08-08 18:00:00 487146 71 890 <NA> <NA> 0.075484 0.86 0.503 67 Left Foot Open Play Blocked Shot
Gabriel Agbonlahor 1 2015 82 2015-08-08 18:00:00 487150 71 890 548651 Micah Richards 0.054676 0.795 0.507 81 Left Foot Open Play Saved Shot
Idrissa Gueye 1 2015 82 2015-08-08 18:00:00 487136 71 668 548653 Jordan Veretout 0.018269 0.742 0.598 26 Right Foot Open Play Missed Shot
Idrissa Gueye 1 2015 82 2015-08-08 18:00:00 487144 71 668 548657 Gabriel Agbonlahor 0.113668 0.883 0.561 47 Left Foot Open Play Saved Shot
Leandro Bacuna 1 2015 82 2015-08-08 18:00:00 487141 71 674 <NA> <NA> 0.097371 0.786 0.437 42 Right Foot Direct Freekick Blocked Shot
This page was generated from doc/datasources/WhoScored.ipynb.
You can download the notebook,
[3]:
import soccerdata as sd

WhoScored

[4]:
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021)
print(ws.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:462: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
  warnings.warn(msg)
Provides pd.DataFrames from data available at http://whoscored.com.

    Data will be downloaded as necessary and cached locally in
    ``~/soccerdata/data/WhoScored``.

    Parameters
    ----------
    leagues : string or iterable, optional
        IDs of Leagues to include.
    seasons : string, int or list, optional
        Seasons to include. Supports multiple formats.
        Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
    proxy : 'tor' or dict or list(dict) or callable, optional
        Use a proxy to hide your IP address. Valid options are:
            - "tor": Uses the Tor network. Tor should be running in
              the background on port 9050.
            - dict: A dictionary with the proxy to use. The dict should be
              a mapping of supported protocols to proxy addresses. For example::

                  {
                      'http': 'http://10.10.1.10:3128',
                      'https': 'http://10.10.1.10:1080',
                  }

            - list(dict): A list of proxies to choose from. A different proxy will
              be selected from this list after failed requests, allowing rotating
              proxies.
            - callable: A function that returns a valid proxy. This function will
              be called after failed requests, allowing rotating proxies.
    no_cache : bool
        If True, will not use cached data.
    no_store : bool
        If True, will not store downloaded data.
    data_dir : Path
        Path to directory where data will be cached.
    path_to_browser : Path, optional
        Path to the Chrome executable.
    headless : bool, default: True
        If True, will run Chrome in headless mode. Setting this to False might
        help to avoid getting blocked.

Game schedule

[5]:
epl_schedule = ws.read_schedule()
epl_schedule.head()
[5]:
game_id home_team away_team date url stage
league season game
ENG-Premier League 2021 2020-09-12 Crystal Palace-Southampton 1485186 Crystal Palace Southampton 2020-09-12 15:00:00 https://www.whoscored.com/Matches/1485186/Live... NaN
2020-09-12 Fulham-Arsenal 1485187 Fulham Arsenal 2020-09-12 12:30:00 https://www.whoscored.com/Matches/1485187/Live... NaN
2020-09-12 Liverpool-Leeds United 1485188 Liverpool Leeds United 2020-09-12 17:30:00 https://www.whoscored.com/Matches/1485188/Live... NaN
2020-09-12 West Ham United-Newcastle United 1485191 West Ham United Newcastle United 2020-09-12 20:00:00 https://www.whoscored.com/Matches/1485191/Live... NaN
2020-09-13 Tottenham-Everton 1485189 Tottenham Everton 2020-09-13 16:30:00 https://www.whoscored.com/Matches/1485189/Live... NaN

Injured and suspended players

[6]:
missing_players = ws.read_missing_players(match_id=1485184)
missing_players.head()
[6]:
game_id player_id reason status
league season game team player
ENG-Premier League 2021 2021-01-12 Burnley-Manchester United Burnley Charlie Taylor 1485184 107462 injured doubtful Doubtful
Dwight McNeil 1485184 357427 injured doubtful Doubtful
Jay Rodriguez 1485184 33891 injured doubtful Doubtful
Jimmy Dunne 1485184 366743 injured doubtful Doubtful
Manchester United Eric Bailly 1485184 243814 injured doubtful Doubtful

Match event stream data

[7]:
events = ws.read_events(match_id=1485184)
events.head()
[7]:
period minute expanded_minute type outcome_type team player qualifiers x y end_x end_y goal_mouth_y goal_mouth_z is_touch is_shot is_goal related_event_id related_player_id blocked_x blocked_y card_type game_id team_id player_id
league season game id
ENG-Premier League 2021 2021-01-12 Burnley-Manchester United 2253458317 PreMatch 0 0 FormationSet Successful Burnley NaN [{'type': {'displayName': 'TeamPlayerFormation... 0.0 0.0 NaN NaN NaN NaN False NaN NaN NaN NaN NaN NaN NaN 1485184 184 NaN
2253458375 PreMatch 0 0 FormationSet Successful Man Utd NaN [{'type': {'displayName': 'CaptainPlayerId', '... 0.0 0.0 NaN NaN NaN NaN False NaN NaN NaN NaN NaN NaN NaN 1485184 32 NaN
2253487469 FirstHalf 0 0 Start Successful Burnley NaN [] 0.0 0.0 NaN NaN NaN NaN False NaN NaN NaN NaN NaN NaN NaN 1485184 184 NaN
2253487473 FirstHalf 0 0 Start Successful Man Utd NaN [] 0.0 0.0 NaN NaN NaN NaN False NaN NaN NaN NaN NaN NaN NaN 1485184 32 NaN
2253487625 FirstHalf 0 0 Pass Successful Burnley Ashley Westwood [{'type': {'displayName': 'Angle', 'value': 21... 50.3 50.3 30.5 50.3 NaN NaN True NaN NaN NaN NaN NaN NaN NaN 1485184 184 79050.0

Match event stream data can be returned in various formats, which can be selected with the “output_fmt” parameter.

  • events (default): Returns a dataframe with all events.

  • raw: Returns the original unformatted WhoScored JSON.

  • spadl: Returns a dataframe with the SPADL representation of the original events.

  • atomic-spadl: Returns a dataframe with the Atomic-SPADL representation of the original events.

  • loader: Returns a socceration.data.opta.OptaLoader instance

[12]:
events = ws.read_events(match_id=1485184, output_fmt="raw")

import json
print(json.dumps(events[1485184][0], indent=2))
{
  "eventId": 2,
  "expandedMinute": 0,
  "id": 2253487473,
  "isTouch": false,
  "minute": 0,
  "outcomeType": {
    "displayName": "Successful",
    "value": 1
  },
  "period": {
    "displayName": "FirstHalf",
    "value": 1
  },
  "qualifiers": [],
  "satisfiedEventsTypes": [],
  "second": 0,
  "teamId": 32,
  "type": {
    "displayName": "Start",
    "value": 32
  },
  "x": 0,
  "y": 0
}
[13]:
actions = ws.read_events(match_id=1485184, output_fmt='spadl')
actions.head()
[13]:
game_id original_event_id period_id time_seconds team_id player_id start_x end_x start_y end_y type_id result_id bodypart_id action_id player team
0 1485184 2253487625 1 0.0 184 79050.0 52.815 32.025 34.204 34.204 0 1 0 0 Ashley Westwood Burnley
1 1485184 2253487639 1 2.0 184 131464.0 31.080 38.220 36.312 15.844 0 1 0 1 James Tarkowski Burnley
2 1485184 NaN 1 4.5 184 80067.0 38.220 43.365 15.844 12.512 21 1 0 2 Matthew Lowton Burnley
3 1485184 2253487685 1 7.0 184 80067.0 43.365 90.300 12.512 49.708 0 1 0 3 Matthew Lowton Burnley
4 1485184 2253487689 1 11.0 184 93473.0 90.300 105.000 49.708 38.828 11 0 0 4 Robbie Brady Burnley
[14]:
atomic_actions = ws.read_events(match_id=1485184, output_fmt='atomic-spadl')
atomic_actions.head()
[14]:
game_id original_event_id action_id period_id time_seconds team_id player_id x y dx dy type_id bodypart_id player team
0 1485184 2253487625 0 1 0.00 184 79050.0 52.815 34.204 -20.790 0.000 0 0 Ashley Westwood Burnley
1 1485184 2253487625 1 1 1.00 184 131464.0 32.025 34.204 0.000 0.000 23 0 James Tarkowski Burnley
2 1485184 2253487639 2 1 2.00 184 131464.0 31.080 36.312 7.140 -20.468 0 0 James Tarkowski Burnley
3 1485184 2253487639 3 1 3.25 184 80067.0 38.220 15.844 0.000 0.000 23 0 Matthew Lowton Burnley
4 1485184 NaN 4 1 4.50 184 80067.0 38.220 15.844 5.145 -3.332 21 0 Matthew Lowton Burnley
[15]:
# Scrape all games and return a socceration.data.opta.OptaLoader
loader = ws.read_events(output_fmt='loader')

# Now use this loader to load the data
print("Games:")
df_games = loader.games(competition_id="ENG-Premier League", season_id="2021")
display(df_games.head())

print("Teams:")
df_teams = loader.teams(game_id=1485184)
display(df_teams.head())

print("Players:")
df_players = loader.players(game_id=1485184)
display(df_players.head())

print("Events:")
df_events = loader.events(game_id=1485184)
display(df_events.head())

# You can use the socceraction package to convert the events
# to SPADL and to compute xT or VAEP action values
Games:
game_id season_id competition_id game_day game_date home_team_id away_team_id home_score away_score duration referee venue attendance home_manager away_manager
0 1485494 2021 ENG-Premier League None 2021-04-04 12:00:00 18 184 3 2 98 Andre Marriner St. Mary's Stadium 0 Ralph Hasenhüttl Sean Dyche
1 1485300 2021 ENG-Premier League None 2020-12-16 20:00:00 170 211 0 0 95 Robert Jones Craven Cottage 0 Scott Parker Graham Potter
2 1485264 2021 ENG-Premier League None 2020-12-06 19:15:00 26 161 4 0 97 Craig Pawson Anfield 2000 Jürgen Klopp Nuno Espírito Santo
3 1485519 2021 ENG-Premier League None 2021-05-16 16:30:00 175 26 1 2 102 Mike Dean The Hawthorns 0 Sam Allardyce Jürgen Klopp
4 1485436 2021 ENG-Premier League None 2021-03-19 20:00:00 170 19 1 2 100 David Coote Craven Cottage 0 Scott Parker Marcelo Bielsa
Teams:
team_id team_name
0 184 Burnley
1 32 Man Utd
Players:
game_id team_id player_id player_name is_starter minutes_played jersey_number starting_position
0 1485184 184 105720 Nick Pope True 102 1 GK
1 1485184 184 80067 Matthew Lowton True 102 2 DR
2 1485184 184 94935 Ben Mee True 102 6 DC
3 1485184 184 131464 James Tarkowski True 102 5 DC
4 1485184 184 24148 Erik Pieters True 102 23 DL
Events:
game_id event_id period_id team_id player_id type_id timestamp minute second outcome start_x start_y end_x end_y qualifiers related_player_id touch shot goal type_name
0 1485184 2253487473 1 32 NaN 32 2021-01-12 20:15:00 0 0 True 0.0 0.0 0.0 0.0 {} NaN False False False start
1 1485184 2253487469 1 184 NaN 32 2021-01-12 20:15:00 0 0 True 0.0 0.0 0.0 0.0 {} NaN False False False start
2 1485184 2253487625 1 184 79050.0 1 2021-01-12 20:15:00 0 0 True 50.3 50.3 30.5 50.3 {213: '3.1', 178: True, 141: '50.3', 212: '20.... NaN True False False pass
3 1485184 2253487639 1 184 131464.0 1 2021-01-12 20:15:02 0 2 True 29.6 53.4 36.4 23.3 {178: True, 213: '5.0', 212: '21.7', 141: '23.... NaN True False False pass
4 1485184 2253487685 1 184 80067.0 1 2021-01-12 20:15:07 0 7 True 41.3 18.4 86.0 73.1 {1: True, 213: '0.7', 56: 'Center', 178: True,... NaN True False False pass
[ ]:

How-to Guides

Here you’ll find short answers to “How do I…?” types of questions. These how-to guides don’t cover topics in depth – you’ll find that material in the API Reference. However, these guides will help you quickly accomplish common tasks.

How to add custom leagues

SoccerData has built-in support to scrape data from the top-5 European leagues and the major international tournaments. The leagues available for each source can be listed with the available_leagues() class method.

import soccerdata as sd
sd.FBref.available_leagues()
>>> ['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'ITA-Serie A']

This documentation explains how to add custom leagues.

Warning

Note that you might encounter errors when trying to scrape data for the leagues you added yourself. This is because the data provided for these leagues might have a different structure. If you encounter such an error, please do not open an issue on GitHub, but try to fix it yourself.

Adding a new league

Additional leagues can configured in SOCCERDATA_DIR/config/league_dict.json. This file should contain a mapping between a generic name for the league and the identifier used internally by each data source (see below) that you want to support. For example, for the Dutch Eredivisie this would be:

{
  "NED-Eredivisie": {
    "ClubElo": "NED_1",
    "MatchHistory": "N1",
    "SoFIFA": "[Netherlands] Eredivisie",
    "FBref": "Eredivisie",
    "ESPN": "ned.1",
    "FiveThirtyEight": "eredivisie",
    "WhoScored": "Netherlands - Eredivisie",
    "Sofascore": "Eredivisie",
    "season_start": "Aug",
    "season_end": "May"
  }
}

The season_end and season_start fields are optional. This should be the month in which the last game and first game of a season are played, respectively. If they are not provided, June is used as the last month of the season and July as the first one.

Now, restart your Python session and check whether it is added to available leagues by running the command below.

>>> import soccerdata as sd
>>> sd.FBref.available_leagues()
[..., 'NED-Eredivisie', ...]

Internal identifiers

Below are instructions on how to find the internal identifiers for each data source.

ClubElo

The internal identifier has the format {country_code}_{level}. The get the country code, go to https://clubelo.com/, click on the league you want to add and take the three-letter code in the URL. For example, the URL for the Dutch Eredivisie is http://clubelo.com/NED which means that the country identifier is NED. The level is the number of the league, starting with 1 for the top league. The internal identifier for the Dutch Eredivisie is therefore NED_1.

MatchHistory

The internal identifier has the format {country_code}{level}. Download the CSV file corresponding corresponding to the league you would like to add from https://www.football-data.co.uk/data.php and take the value in the Div column.

SoFIFA

The internal identifier has the format [{region}] {league name}. Go to https://sofifa.com/api/league to get the list of available leagues. The {region} corresponds to the nationName field in the JSON response. The {league name} corresponds to the value field.

FBref

Go to https://fbref.com/en/comps/ and take the value in the Competition Name column.

ESPN

The internal identifier has the format {country_code}.{level}. Go to https://www.espn.com/soccer/competitions, click on the league you want to add and take the value in the URL after /league/_/name/.

FiveThirtyEight

Go to https://projects.fivethirtyeight.com/soccer-predictions/, select the relevant league and take the value in the URL after /soccer-predictions/.

WhoScored

Go to https://www.whoscored.com and use the JavaScript console to get the value of the allRegions variable. The internal identifier has the format {region name} - {league name}.

FotMob

The internal identifier is identical in style to the general format: [{region}]-{league name}. Go to https://www.fotmob.com/api/allLeagues to get the list of available leagues. The {region} corresponds to the ccode field in the JSON response. The {league name} corresponds to the name field.

Sofascore

Go to https://api.sofascore.com/api/v1/config/unique-tournaments/EN/football to get the list of major leagues and tournaments. Access uniqueTournaments in the JSON response, and the {league name} corresponds to the name field.

Troubleshooting

If you add a new league and it doesn’t show up in the list of available leagues, there are a few things you can do to debug the problem.

  1. Make sure to reload the soccerdata module after you modify the league_dict.json file. The most straightforward way to do this is to restart your notebook or Python interpreter.

  2. Check whether your league_dict.json file is at the correct location. If so, you should see this appear in the log messages when importing the soccerdata library.

    >>> import soccerdata as sd
    [11/25/22 11:49:12] INFO     Custom team name replacements loaded from <path>/teamname_replacements.json.                                                                                                _config.py:83
                        INFO     Custom league dict loaded from <path>/league_dict.json.                                                                                                                    _config.py:153
    
  3. Check whether the content of your league_dict.json file is valid JSON. You can check the file’s syntax using Python’s built-in json.tool module.

    $ cat config/league_dict.json | python -m json.tool
    Expecting ',' delimiter: line 1 column 10 (char 9)
    

How to use a proxy server

You can setup a SOCKS5 proxy with Tor. Checkout the installation guide on the Tor website for installation instructions. After installing Tor, make sure to start it up before scraping. This can easily be done by running the tor command from your terminal (in a separate window), Tor will start up and run on “localhost:9050” by default. Once Tor is running, you can enable the extension by setting proxy='tor'.

ws = sd.WhoScored(proxy='tor')

The code snippet above assumes you have a Tor proxy running on “localhost:9050”. Many distributions indeed default to having a SOCKS proxy listening on port 9050, but some may not. In particular, the Tor Browser Bundle defaults to listening on port 9150. You can specify a custom host and port as

ws = sd.WhoScored(proxy={
     "http": "socks5://127.0.0.1:9150",
     "https": "socks5://127.0.0.1:9150",
 })

Data Analysis Examples

Below is a gallery of examples of data analysis using the soccerdata package. If you have an example you would like to share, please submit a pull request to the SoccerData GitHub repository.

This page was generated from doc/examples/ClubElo - Evolution of current top teams.ipynb.
You can download the notebook,
[2]:
import soccerdata as sd
[3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_context("notebook")
sns.set_style("whitegrid")

Evolution of top team’s Elo ratings

How did the current top 5 teams in the world develop over time?

[4]:
elo = sd.ClubElo()
current_elo = elo.read_by_date()
current_elo.head()
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  return pd.read_csv(
[4]:
rank country level elo from to league
team
Man City 1.0 ENG 1 2079.984619 2023-06-11 2023-08-11 ENG-Premier League
Liverpool 2.0 ENG 1 1949.932861 2023-06-11 2023-08-13 ENG-Premier League
Bayern 3.0 GER 1 1937.012451 2023-05-28 2023-08-18 GER-Bundesliga
Arsenal 4.0 ENG 1 1928.216187 2023-06-11 2023-08-12 ENG-Premier League
Real Madrid 5.0 ESP 1 1907.589233 2023-06-08 2023-08-12 ESP-La Liga
[5]:
num_teams = 5
smoothing = 100
elo_top_development = pd.concat(
    [elo.read_team_history(team)['elo'].rolling(smoothing).mean()
     for team in current_elo.reset_index()['team'][:num_teams]
    ],
    axis=1)
elo_top_development.columns = current_elo.reset_index()['team'][:num_teams]
elo_top_development.fillna(method='ffill')

fig = plt.figure(figsize=(16, 10))
ax1 = fig.add_subplot(111, ylabel='ELO rolling avg.', xlabel='Date')
elo_top_development.plot(ax=ax1)
ax1.legend(loc='upper left', frameon=False, bbox_to_anchor=(0, 1.05), ncol=num_teams)
sns.despine();
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  return pd.read_csv(
_images/examples_ClubElo_-_Evolution_of_current_top_teams_5_1.png
This page was generated from doc/examples/MatchHistory - Home advantage.ipynb.
You can download the notebook,
[2]:
import soccerdata as sd
[3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_context("notebook")
sns.set_style("whitegrid")

Home team advantage in the Italian Serie A

We all know sports teams have an advantage when playing at home. Here’s a look at home team advantage for 5 years of the Serie A.

[4]:
seriea_hist = sd.MatchHistory('ITA-Serie A', range(2018, 2023))
games = seriea_hist.read_games()
games.sample(5)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:471: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
  warnings.warn(msg, stacklevel=1)
[4]:
date home_team away_team FTHG FTAG FTR HTHG HTAG HTR HS ... AvgC<2.5 AHCh B365CAHH B365CAHA PCAHH PCAHA MaxCAHH MaxCAHA AvgCAHH AvgCAHA
league season game
ITA-Serie A 1819 2018-12-22 Lazio-Cagliari 2018-12-22 12:00:00 Lazio Cagliari 3 1 H 2 0 H 15 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021 2020-12-20 Torino-Bologna 2020-12-20 11:30:00 Torino Bologna 1 1 D 0 0 D 13 ... 2.25 0.0 1.91 2.02 1.92 2.0 1.94 2.11 1.86 2.00
1819 2019-03-03 Udinese-Bologna 2019-03-03 12:00:00 Udinese Bologna 2 1 H 1 1 D 11 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2021 2021-02-20 Genoa-Verona 2021-02-20 17:00:00 Genoa Verona 2 2 D 0 1 A 12 ... 1.64 0.0 1.95 1.98 1.93 2.0 2.07 2.00 1.93 1.93
1819 2019-03-10 Fiorentina-Lazio 2019-03-10 12:00:00 Fiorentina Lazio 1 1 D 0 1 A 7 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

[5]:
def home_away_results(games: pd.DataFrame):
    """Returns aggregated home/away results per team"""

    res = (
        pd.melt(games.reset_index(),
                id_vars=['date', 'FTR'],
                value_name='team', var_name='is_home',
                value_vars=['home_team', 'away_team']))

    res.is_home = res.is_home.replace(['home_team', 'away_team'], ['Home', 'Away'])
    res['win'] = res['lose'] = res['draw'] = 0
    res.loc[(res['is_home'] == 'Home') & (res['FTR'] == 'H'), 'win'] = 1
    res.loc[(res['is_home'] == 'Away') & (res['FTR'] == 'A'), 'win'] = 1
    res.loc[(res['is_home'] == 'Home') & (res['FTR'] == 'A'), 'lose'] = 1
    res.loc[(res['is_home'] == 'Away') & (res['FTR'] == 'H'), 'lose'] = 1
    res.loc[res['FTR'] == 'D', 'draw'] = 1

    g = res.groupby(['team', 'is_home'])
    w = (g.win.agg(['sum', 'mean'])
         .rename(columns={'sum': 'n_win', 'mean': 'win_pct'}))
    l = (g.lose.agg(['sum', 'mean'])
         .rename(columns={'sum': 'n_lose', 'mean': 'lose_pct'}))
    d = (g.draw.agg(['sum', 'mean'])
         .rename(columns={'sum': 'n_draw', 'mean': 'draw_pct'}))

    res = pd.concat([w, l, d], axis=1)
    return res
[6]:
results = home_away_results(games)
results.head(6)
[6]:
n_win win_pct n_lose lose_pct n_draw draw_pct
team is_home
Atalanta Away 52 0.547368 16 0.168421 27 0.284211
Home 56 0.589474 23 0.242105 16 0.168421
Benevento Away 10 0.263158 18 0.473684 10 0.263158
Home 4 0.105263 20 0.526316 14 0.368421
Bologna Away 22 0.231579 48 0.505263 25 0.263158
Home 35 0.368421 29 0.305263 31 0.326316

The overall picture shows most teams have a clear advantage at home:

[7]:
g = sns.FacetGrid(results.reset_index(), hue='team', palette='Set2', height=6, aspect=.5)
g.map(sns.pointplot, 'is_home', 'win_pct', order=["Away", "Home"])
g.set_axis_labels('', 'win %');
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/.venv/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
_images/examples_MatchHistory_-_Home_advantage_8_1.png

But there are a few exceptions.

[8]:
g = sns.FacetGrid(results.reset_index(), col='team', col_wrap=5)
g.map(sns.pointplot, 'is_home', 'win_pct', order=["Away", "Home"])
g.set_axis_labels('', 'win %');
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/.venv/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
_images/examples_MatchHistory_-_Home_advantage_10_1.png

API Reference

This part of the documentation covers all the interfaces of the implemented data scrapers.

Club Elo

class soccerdata.ClubElo(proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/ClubElo'))

Provides pd.DataFrames from CSV API at http://api.clubelo.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/ClubElo.

Since the source does not provide league names, this class will not filter by league. League names will be inserted from the other sources where available. Leagues that are only covered by clubelo.com will have NaN values.

Parameters:
  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

read_by_date(date=None)

Retrieve ELO scores for all teams at specified date.

Elo scores are available as early as 1939. Values before 1960 should be considered provisional.

Parameters:

date (datetime object or string like 'YYYY-MM-DD') – Date for which to retrieve ELO scores. If no date is specified, get today’s scores.

Raises:
  • TypeError – If date is not a date string or datetime object.

  • ValueError – If data is an invalid date string.

Return type:

pd.DataFrame

read_team_history(team, max_age=1)

Retrieve full ELO history for one club.

For the exact spelling of a club’s name, check the result of read_by_date() or clubelo.com. You can also use alternative team names specified in teamname_replacements.json. Values before 1960 should be considered provisional.

Parameters:
  • team (str) – The club’s name.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

Raises:
  • TypeError – If max_age is not an integer or timedelta object.

  • ValueError – If no ratings for the given team are available.

Return type:

pd.DataFrame

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

get(url, filepath=None, max_age=None, no_cache=False, var=None)

Load data from url.

By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.

Parameters:
  • url (str) – URL to download.

  • filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

  • no_cache (bool) – If True, will not use cached data. Overrides the class property.

  • var (str or list of str, optional) – Return a JavaScript variable instead of the page source.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Returns:

File-like object of downloaded data.

Return type:

io.BufferedIOBase

property leagues: List[str]

Return a list of selected leagues.

property seasons: List[str]

Return a list of selected seasons.

ESPN

class soccerdata.ESPN(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/ESPN'))

Provides pd.DataFrames from JSON api available at http://site.api.espn.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/ESPN.

Parameters:
  • leagues (string or iterable, optional) – IDs of leagues to include.

  • seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

property seasons: List[str]

Return a list of selected seasons.

read_schedule(force_cache=False)

Retrieve the game schedule for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_matchsheet(match_id=None)

Retrieve match sheets for the selected leagues and seasons.

Parameters:

match_id (int or list of int, optional) – Retrieve the match sheet for a specific game.

Raises:

ValueError – If no games with the given IDs were found for the selected seasons and leagues.

Return type:

pd.DataFrame.

read_lineup(match_id=None)

Retrieve lineups for the selected leagues and seasons.

Parameters:

match_id (int or list of int, optional) – Retrieve the lineup for a specific game.

Raises:

ValueError – If no games with the given IDs were found for the selected seasons and leagues.

Return type:

pd.DataFrame.

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

get(url, filepath=None, max_age=None, no_cache=False, var=None)

Load data from url.

By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.

Parameters:
  • url (str) – URL to download.

  • filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

  • no_cache (bool) – If True, will not use cached data. Overrides the class property.

  • var (str or list of str, optional) – Return a JavaScript variable instead of the page source.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Returns:

File-like object of downloaded data.

Return type:

io.BufferedIOBase

property leagues: List[str]

Return a list of selected leagues.

FBref

class soccerdata.FBref(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/FBref'))

Provides pd.DataFrames from data at http://fbref.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/FBref.

Parameters:
  • leagues (string or iterable, optional) – IDs of leagues to include. For efficiently reading data from the Top-5 European leagues, use “Big 5 European Leagues Combined”.

  • seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

read_leagues(split_up_big5=False)

Retrieve selected leagues from the datasource.

Parameters:

split_up_big5 (bool) – If True, it will load the “Big 5 European Leagues Combined” instead of each league individually.

Return type:

pd.DataFrame

read_seasons(split_up_big5=False)

Retrieve the selected seasons for the selected leagues.

Parameters:

split_up_big5 (bool) – If True, it will load the “Big 5 European Leagues Combined” instead of each league individually.

Return type:

pd.DataFrame

read_team_season_stats(stat_type='standard', opponent_stats=False)

Retrieve aggregated season stats for all teams in the selected leagues and seasons.

The following stat types are available:
  • ‘standard’

  • ‘keeper’

  • ‘keeper_adv’

  • ‘shooting’

  • ‘passing’

  • ‘passing_types’

  • ‘goal_shot_creation’

  • ‘defense’

  • ‘possession’

  • ‘playing_time’

  • ‘misc’

Parameters:
  • stat_type (str) – Type of stats to retrieve.

  • opponent_stats (bool) – If True, will retrieve opponent stats.

Raises:

ValueError – If stat_type is not valid.

Return type:

pd.DataFrame

read_team_match_stats(stat_type='schedule', opponent_stats=False, team=None, force_cache=False)

Retrieve the match logs for all teams in the selected leagues and seasons.

The following stat types are available:
  • ‘schedule’

  • ‘keeper’

  • ‘shooting’

  • ‘passing’

  • ‘passing_types’

  • ‘goal_shot_creation’

  • ‘defense’

  • ‘possession’

  • ‘misc’

Parameters:
  • stat_type (str) – Type of stats to retrieve.

  • opponent_stats (bool) – If True, will retrieve opponent stats.

  • team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.

  • force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Raises:

ValueError – If stat_type is not valid.

Return type:

pd.DataFrame

read_player_season_stats(stat_type='standard')

Retrieve players from the datasource for the selected leagues and seasons.

The following stat types are available:
  • ‘standard’

  • ‘shooting’

  • ‘passing’

  • ‘passing_types’

  • ‘goal_shot_creation’

  • ‘defense’

  • ‘possession’

  • ‘playing_time’

  • ‘misc’

  • ‘keeper’

  • ‘keeper_adv’

Parameters:

stat_type (str) – Type of stats to retrieve.

Raises:

TypeError – If stat_type is not valid.

Return type:

pd.DataFrame

read_schedule(force_cache=False)

Retrieve the game schedule for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_player_match_stats(stat_type='summary', match_id=None, force_cache=False)

Retrieve the match stats for the selected leagues and seasons.

The following stat types are available:
  • ‘summary’

  • ‘keepers’

  • ‘passing’

  • ‘passing_types’

  • ‘defense’

  • ‘possession’

  • ‘misc’

Parameters:
  • stat_type (str) – Type of stats to retrieve.

  • match_id (int or list of int, optional) – Retrieve the event stream for a specific game.

  • force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.

Raises:
  • ValueError – If no games with the given IDs were found for the selected seasons and leagues.

  • TypeError – If stat_type is not valid.

Return type:

pd.DataFrame

read_lineup(match_id=None, force_cache=False)

Retrieve lineups for the selected leagues and seasons.

Parameters:
  • match_id (int or list of int, optional) – Retrieve the lineup for a specific game.

  • force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.

Raises:

ValueError – If no games with the given IDs were found for the selected seasons and leagues.

Return type:

pd.DataFrame.

read_events(match_id=None, force_cache=False)

Retrieve match events for the selected seasons or selected matches.

The data returned includes the timing of goals, cards and substitutions. Also includes the players who are involved in the event.

Parameters:
  • match_id (int or list of int, optional) – Retrieve the events for a specific game.

  • force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.

Raises:

ValueError – If no games with the given IDs were found for the selected seasons and leagues.

Return type:

pd.DataFrame.

read_shot_events(match_id=None, force_cache=False)

Retrieve shooting data for the selected seasons or selected matches.

The data returned includes who took the shot, when, with which body part and from how far away. Additionally, the player creating the chance and also the creation before this are included in the data.

Parameters:
  • match_id (int or list of int, optional) – Retrieve the shots for a specific game.

  • force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.

Raises:

ValueError – If no games with the given IDs were found for the selected seasons and leagues.

Return type:

pd.DataFrame.

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

FiveThirtyEight

class soccerdata.FiveThirtyEight(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/FiveThirtyEight'))

Provides pd.DataFrames from FiveThirtyEight’s “Club Soccer Predictions” project.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/FiveThirtyEight.

Original project and background info: https://projects.fivethirtyeight.com/soccer-predictions/ and https://fivethirtyeight.com/features/how-our-club-soccer-projections-work/

Parameters:
  • leagues (string or iterable, optional) – IDs of leagues to include.

  • seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

read_games()

Retrieve all games for the selected leagues.

Return type:

pd.DataFrame

read_forecasts()

Retrieve the forecasted results for the selected leagues.

Return type:

pd.DataFrame

read_clinches()

Retrieve clinches for the selected leagues.

Return type:

pd.DataFrame

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

FotMob

class soccerdata.FotMob(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/FotMob'))

Provides pd.DataFrames from data available at http://www.fotmob.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/FotMob.

Parameters:
  • leagues (string or iterable, optional) – IDs of Leagues to include.

  • seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ’tor’: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

read_leagues()

Retrieve the selected leagues from the datasource.

Return type:

pd.DataFrame

read_seasons()

Retrieve the selected seasons for the selected leagues.

Return type:

pd.DataFrame

read_league_table(force_cache=False)

Retrieve the league table for the selected leagues.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_schedule(force_cache=False)

Retrieve the game schedule for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_team_match_stats(stat_type='Top stats', opponent_stats=True, team=None, force_cache=False)

Retrieve the match stats for the selected leagues and seasons.

The following stat types are available:
  • ‘Top stats’

  • ‘Shots’

  • ‘Expected goals (xG)’

  • ‘Passes’

  • ‘Defence’

  • ‘Duels’

  • ‘Discipline’

Parameters:
  • stat_type (str) – Type of stats to retrieve.

  • opponent_stats (bool) – If True, will retrieve opponent stats.

  • team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.

  • force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.

Raises:
  • TypeError – If stat_type is not valid.

  • ValueError – If no games with the given IDs were found for the selected seasons and leagues.

Return type:

pd.DataFrame

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

MatchHistory

class soccerdata.MatchHistory(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/MatchHistory'))

Provides pd.DataFrames from CSV files available at http://www.football-data.co.uk/data.php.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/MatchHistory.

Parameters:
  • leagues (string or iterable) – IDs of leagues to include.

  • seasons (string, int or list) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path, optional) – Path to directory where data will be cached.

property seasons: List[str]

Return a list of selected seasons.

read_games()

Retrieve game history for the selected leagues and seasons.

Column names are explained here: http://www.football-data.co.uk/notes.txt

Return type:

pd.DataFrame

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

get(url, filepath=None, max_age=None, no_cache=False, var=None)

Load data from url.

By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.

Parameters:
  • url (str) – URL to download.

  • filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

  • no_cache (bool) – If True, will not use cached data. Overrides the class property.

  • var (str or list of str, optional) – Return a JavaScript variable instead of the page source.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Returns:

File-like object of downloaded data.

Return type:

io.BufferedIOBase

property leagues: List[str]

Return a list of selected leagues.

Sofascore

class soccerdata.Sofascore(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/Sofascore'))

Provides pd.DataFrames from data available at http://www.sofascore.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/Sofascore.

Parameters:
  • leagues (string or iterable, optional) – IDs of Leagues to include.

  • seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ’tor’: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

read_leagues()

Retrieve the selected leagues from the datasource.

Return type:

pd.DataFrame

read_seasons()

Retrieve the selected seasons for the selected leagues.

Return type:

pd.DataFrame

read_league_table(force_cache=False)

Retrieve the league table for the selected leagues.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_schedule(force_cache=False)

Retrieve the game schedule for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

SoFIFA

class soccerdata.SoFIFA(leagues=None, versions='latest', proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/SoFIFA'))

Provides pd.DataFrames from data at http://sofifa.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/SoFIFA.

Parameters:
  • leagues (string or iterable, optional) – IDs of leagues to include.

  • versions (string, int or list of int, optional) – FIFA releases to include. Should be specified by their ID used in the URL (e.g., 230034). Alternatively, the string “all” can be used to include all versions and “latest” to include the latest version only. Defaults to “latest”.

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

read_leagues()

Retrieve selected leagues from the datasource.

Return type:

pd.DataFrame

read_versions(max_age=1)

Retrieve available FIFA releases and rating updates.

Parameters:

max_age (int for age in days, or timedelta object) – The max. age of the locally cached release history before a new version is downloaded.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Return type:

pd.DataFrame

read_teams()

Retrieve all teams for the selected leagues.

Return type:

pd.DataFrame

read_players(team=None)

Retrieve all players for the selected leagues.

Parameters:

team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.

Raises:

ValueError – If no data is found for the given team(s) in the selected leagues.

Return type:

pd.DataFrame

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

read_team_ratings()

Retrieve ratings for all teams in the selected leagues.

Return type:

pd.DataFrame

read_player_ratings(team=None, player=None)

Retrieve ratings for players.

Parameters:
  • team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.

  • player (int or list of int, optional) – Player(s) to retrieve. If None, will retrieve all players.

Return type:

pd.DataFrame

Understat

class soccerdata.Understat(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/Understat'))

Provides pd.DataFrames from data at https://understat.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/Understat.

Parameters:
  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

  • leagues (str | List[str] | None)

  • seasons (str | int | Iterable[str | int] | None)

property seasons: List[str]

Return a list of selected seasons.

read_leagues()

Retrieve the selected leagues from the datasource.

Return type:

pd.DataFrame

read_seasons()

Retrieve the selected seasons from the datasource.

Return type:

pd.DataFrame

read_schedule(include_matches_without_data=True, force_cache=False)

Retrieve the matches for the selected leagues and seasons.

Parameters:
  • include_matches_without_data (bool) – By default matches with and without data are returned. If False, will only return matches with data.

  • force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_team_match_stats(force_cache=False)

Retrieve the team match stats for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_player_season_stats(force_cache=False)

Retrieve the player season stats for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_player_match_stats(match_id=None)

Retrieve the player match stats for the selected leagues and seasons.

Parameters:

match_id (int or list of int, optional) – Retrieve the player match stats for a specific match.

Raises:

ValueError – If the given match_id could not be found in the selected seasons.

Return type:

pd.DataFrame

read_shot_events(match_id=None)

Retrieve the shot events for the selected matches or the selected leagues and seasons.

Parameters:

match_id (int or list of int, optional) – Retrieve the shot events for a specific match.

Raises:

ValueError – If the given match_id could not be found in the selected seasons.

Return type:

pd.DataFrame

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

get(url, filepath=None, max_age=None, no_cache=False, var=None)

Load data from url.

By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.

Parameters:
  • url (str) – URL to download.

  • filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

  • no_cache (bool) – If True, will not use cached data. Overrides the class property.

  • var (str or list of str, optional) – Return a JavaScript variable instead of the page source.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Returns:

File-like object of downloaded data.

Return type:

io.BufferedIOBase

property leagues: List[str]

Return a list of selected leagues.

WhoScored

class soccerdata.WhoScored(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/WhoScored'), path_to_browser=None, headless=False)

Provides pd.DataFrames from data available at http://whoscored.com.

Data will be downloaded as necessary and cached locally in ~/soccerdata/data/WhoScored.

Parameters:
  • leagues (string or iterable, optional) – IDs of Leagues to include.

  • seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]

  • proxy ('tor' or dict or list(dict) or callable, optional) –

    Use a proxy to hide your IP address. Valid options are:
    • ”tor”: Uses the Tor network. Tor should be running in the background on port 9050.

    • dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:

      {
          'http': 'http://10.10.1.10:3128',
          'https': 'http://10.10.1.10:1080',
      }
      
    • list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.

    • callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.

  • no_cache (bool) – If True, will not use cached data.

  • no_store (bool) – If True, will not store downloaded data.

  • data_dir (Path) – Path to directory where data will be cached.

  • path_to_browser (Path, optional) – Path to the Chrome executable.

  • headless (bool, default: True) – If True, will run Chrome in headless mode. Setting this to False might help to avoid getting blocked. Only supported for Selenium <4.13.

read_schedule(force_cache=False)

Retrieve the game schedule for the selected leagues and seasons.

Parameters:

force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.

Return type:

pd.DataFrame

read_missing_players(match_id=None, force_cache=False)

Retrieve a list of injured and suspended players ahead of each game.

Parameters:
  • match_id (int or list of int, optional) – Retrieve the missing players for a specific game.

  • force_cache (bool) – By default no cached data is used to scrapre the list of available games for the current season. If True, will force the use of cached data anyway.

Raises:

ValueError – If the given match_id could not be found in the selected seasons.

Return type:

pd.DataFrame

read_events(match_id=None, force_cache=False, live=False, output_fmt='events')

Retrieve the the event data for each game in the selected leagues and seasons.

Parameters:
Raises:
  • ValueError – If the given match_id could not be found in the selected seasons.

  • ImportError – If the requested output format is ‘spadl’, ‘atomic-spadl’ or ‘loader’ but the socceraction package is not installed.

Return type:

See the description of the output_fmt parameter.

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

If you would like to extend the functionality of soccerdata, you might also be interested in the following modules:

Base Readers

The logic for downloading data from the web is implemented in the base classes that are documented here. The base classes are not intended to be used directly but rather to be subclassed by the specific readers which implement the logic to parse the data.

The BaseRequestsReader is a wrapper around the requests library and is used by scrapers that do not require JavaScript to be executed. The BaseSeleniumReader is a wrapper around the selenium library and is used by scrapers that require JavaScript to be executed.

class soccerdata._common.BaseRequestsReader(leagues=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data'))

Base class for readers that use the Python requests module.

Parameters:
  • leagues (str | List[str] | None)

  • proxy (str | Dict[str, str] | List[Dict[str, str]] | Callable[[], Dict[str, str]] | None)

  • no_cache (bool)

  • no_store (bool)

  • data_dir (Path)

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

get(url, filepath=None, max_age=None, no_cache=False, var=None)

Load data from url.

By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.

Parameters:
  • url (str) – URL to download.

  • filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

  • no_cache (bool) – If True, will not use cached data. Overrides the class property.

  • var (str or list of str, optional) – Return a JavaScript variable instead of the page source.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Returns:

File-like object of downloaded data.

Return type:

io.BufferedIOBase

property leagues: List[str]

Return a list of selected leagues.

property seasons: List[str]

Return a list of selected seasons.

class soccerdata._common.BaseSeleniumReader(leagues=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data'), path_to_browser=None, headless=True)

Base class for readers that use Selenium.

Parameters:
  • leagues (str | List[str] | None)

  • proxy (str | Dict[str, str] | List[Dict[str, str]] | Callable[[], Dict[str, str]] | None)

  • no_cache (bool)

  • no_store (bool)

  • data_dir (Path)

  • path_to_browser (Path | None)

  • headless (bool)

classmethod available_leagues()

Return a list of league IDs available for this source.

Return type:

List[str]

get(url, filepath=None, max_age=None, no_cache=False, var=None)

Load data from url.

By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.

Parameters:
  • url (str) – URL to download.

  • filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.

  • max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.

  • no_cache (bool) – If True, will not use cached data. Overrides the class property.

  • var (str or list of str, optional) – Return a JavaScript variable instead of the page source.

Raises:

TypeError – If max_age is not an integer or timedelta object.

Returns:

File-like object of downloaded data.

Return type:

io.BufferedIOBase

property leagues: List[str]

Return a list of selected leagues.

property seasons: List[str]

Return a list of selected seasons.

Utilities

_common.season_code()

Convert a string or int to a season code like ‘1718’.

Parameters:

season (str | int)

Return type:

str

_common.make_game_id()

Return a game id based on date, home and away team.

Parameters:

row (Series)

Return type:

str

_common.standardize_colnames(cols=None)

Convert DataFrame column names to snake case.

Parameters:
  • df (DataFrame)

  • cols (List[str] | None)

Return type:

DataFrame

_common.get_proxy()

Return a public proxy.

Return type:

Dict[str, str]

_common.check_proxy()

Check if proxy is working.

Parameters:

proxy (dict)

Return type:

bool

FAQ

Is web scraping legal?

Even though web scraping is ubiquitous, its legal status remains unclear. That is because whether web scraping is legal will depend on many aspects. It is always best to consult with a lawyer or legal expert to ensure that your web scraping activities are legal and comply with all applicable laws and regulations.

Something doesn’t work

  1. Have you updated to the newest version of soccerdata?

  2. Clear the cache or run your script without caching enabled.

  3. Does the log produce any warnings that sound like they might be related? Maybe the data you are looking for is not available or can not be processed correctly.

  4. Open an issue on GitHub.

Contributor Guide

This document lays out guidelines and advice for contributing to this project. If you’re thinking of contributing, please start by reading this document and getting a feel for how contributing to this project works. If you have any questions, feel free to reach out to Pieter Robberechts, the primary maintainer.

The guide is split into sections based on the type of contribution you’re thinking of making.

Bug Reports

Bug reports are hugely important! Before you raise one, though, please check through the GitHub issues, both open and closed, to confirm that the bug hasn’t been reported before.

When filing an issue, make sure to answer these questions:

  • Which Python version are you using?

  • Which version of soccerdata are you using?

  • What did you do?

  • What did you expect to see?

  • What did you see instead?

The best way to get your bug fixed is to provide a test case, and/or steps to reproduce the issue.

Feature Requests

If you believe there is a feature missing, feel free to raise a feature request on the Issue Tracker.

Documentation Contributions

Documentation improvements are always welcome! The documentation files live in the docs/ directory of the codebase. They’re written in reStructuredText, and use Sphinx to generate the full suite of documentation.

You do not have to setup a development environment to make small changes to the docs. Instead, you can edit files directly on GitHub and suggest changes.

When contributing documentation, please do your best to follow the style of the documentation files. This means a soft-limit of 79 characters wide in your text files and a semi-formal, yet friendly and approachable, prose style.

When presenting Python code, use single-quoted strings ('hello' instead of "hello").

Code Contributions

If you intend to contribute code, do not feel the need to sit on your contribution until it is perfectly polished and complete. It helps everyone involved for you to seek feedback as early as you possibly can. Submitting an early, unfinished version of your contribution for feedback can save you from putting a lot of work into a contribution that is not suitable for the project.

Setting up your development environment

You need Python 3.7.1+ and the following tools:

Install the package with development requirements:

$ poetry install

You can now run an interactive Python session.

$ poetry run python

Steps for submitting Code

When contributing code, you’ll want to follow this checklist:

  1. Fork the repository on GitHub.

  2. Run the tests to confirm they all pass on your system. If they don’t, you’ll need to investigate why they fail. If you’re unable to diagnose this yourself, raise it as a bug report.

  3. Write tests that demonstrate your bug or feature. Ensure that they fail.

  4. Make your change.

  5. Run the entire test suite again, confirming that all tests pass including the ones you just added.

  6. Make sure your code follows the code style discussed below.

  7. Send a GitHub Pull Request to the main repository’s master branch. GitHub Pull Requests are the expected method of code collaboration on this project.

Testing the project

Run the full test suite:

$ nox

List the available Nox sessions:

$ nox --list-sessions

You can also run a specific Nox session. For example, invoke the unit test suite like this:

$ nox --session=tests

Unit tests are located in the tests directory, and are written using the pytest testing framework.

Code style

The soccerdata codebase uses the PEP 8 code style. In addition, we have a few guidelines:

  • Line-length can exceed 79 characters, to 100, when convenient.

  • Line-length can exceed 100 characters, when doing otherwise would be terribly inconvenient.

  • Always use single-quoted strings (e.g. '#soccer'), unless a single-quote occurs within the string.

To ensure all code conforms to this format. You can format the code using the pre-commit hooks.

$ nox --session=pre-commit

Docstrings are to follow the numpydoc guidelines.

Submitting changes

Open a pull request to submit changes to this project.

Your pull request needs to meet the following guidelines for acceptance:

  • The Nox test suite must pass without errors and warnings.

  • Include unit tests.

  • If your changes add functionality, update the documentation accordingly.

Feel free to submit early, though. We can always iterate on this.

To run linting and code formatting checks before committing your change, you can install pre-commit as a Git hook by running the following command:

$ nox --session=pre-commit -- install

It is recommended to open an issue before starting work on anything.

Apache License

Copyright (c) 2021 Pieter Robberechts

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This file incorporates code of the footballdata software package covered by the following copyright and permission notice:

Copyright (c) 2017 skagr

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.