Welcome to SoccerData’s docs!¶
Release v1.7.0. (pip install soccerdata
)
SoccerData is a collection of scrapers to gather soccer data from popular websites, including Club Elo, ESPN, FBref, FiveThirtyEight, Football-Data.co.uk, FotMob, Sofascore, SoFIFA, Understat and WhoScored.
import soccerdata as sd
# Create a scraper class instance for the 2018/19 Premier League
five38 = sd.FiveThirtyEight('ENG-Premier League', '1819')
# Fetch data
games = five38.read_games()
forecasts = five38.read_forecasts()
clinches = five38.read_clinches()
Main features
Access current and historical soccer fixtures, forecasts, detailed match stats, event stream data and more.
All data is provided in the form of Pandas DataFrames with sensible, matching column names and identifiers across datasets to make working with the data and combining data from multiple sources easy.
Data is only downloaded when needed and cached locally to speed up your analyis scripts.
Integrates with the socceraction package to allow analysis of event stream data.
Do you like it? Let’s dive in!
Getting Started¶
New to soccerdata? Well, you came to the right place: this tutorial will walk you through installing, configuring, and using the library. By the end of this tutorial, you will be able to scrape data from the top-5 European leagues and use it to create your own data-driven analyses.
Installation¶
SoccerData can be easily installed via pip:
python3 -m pip install soccerdata
Scraping data¶
Each of the supported data sources has its corresponding
class for fetching data with a uniform API. For example, the
FBref
class is used to fetch data from fbref.com.
import soccerdata as sd
# Create scraper class instance
fbref = sd.FBref()
Once you have a scraper class instance, you can use it to fetch data. See the the examples and API reference for the full list of options available for each scraper. For example, to fetch aggregated shooting stats for all teams:
# Create dataframes
season_stats = fbref.read_team_season_stats(stat_type='shooting')
The data is always returned as a convenient Pandas DataFrame.
league |
season |
team |
#Pl |
90s |
Gls |
Sh |
SoT |
SoT% |
Sh/90 |
SoT/90 |
G/Sh |
G/SoT |
Dist |
FK |
PK |
PKatt |
xG |
npxG |
npxG/Sh |
G-xG |
np:G-xG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ENG-Premier League |
2021 |
Arsenal |
29 |
38.0 |
53 |
455 |
141 |
31.0 |
11.97 |
3.71 |
0.1 |
0.33 |
16.9 |
23 |
6 |
6 |
53.5 |
49.0 |
0.11 |
-0.5 |
-2.0 |
Aston Villa |
24 |
38.0 |
52 |
518 |
179 |
34.6 |
13.63 |
4.71 |
0.09 |
0.26 |
16.5 |
15 |
5 |
6 |
52.9 |
48.5 |
0.1 |
-0.9 |
-1.5 |
||
Brighton |
27 |
38.0 |
39 |
476 |
129 |
27.1 |
12.53 |
3.39 |
0.07 |
0.26 |
16.6 |
14 |
6 |
9 |
51.6 |
44.8 |
0.1 |
-12.6 |
-11.8 |
||
Burnley |
25 |
38.0 |
32 |
383 |
125 |
32.6 |
10.08 |
3.29 |
0.08 |
0.23 |
16.6 |
15 |
3 |
3 |
39.9 |
37.6 |
0.1 |
-7.9 |
-8.6 |
||
Chelsea |
27 |
38.0 |
56 |
553 |
194 |
35.1 |
14.55 |
5.11 |
0.09 |
0.25 |
16.3 |
16 |
8 |
10 |
64.0 |
56.4 |
0.1 |
-8.0 |
-8.4 |
By default, the data for all available leagues and the five most recent seasons will be retrieved. However, in most cases, you would want to limit the data to specific leagues and / or seasons. This can be done by passing a list of leagues and seasons to the constructor of the scraper class. For example:
# Create scraper class instance filtering on specific leagues and seasons
fbref = sd.FBref(leagues=['ENG-Premier League'], seasons=['1718', '1819'])
# Retrieve data for the specified leagues and seasons
season_stats = fbref.read_team_season_stats(stat_type='shooting')
Note that only a limited number of leagues are supported out-of-the-box. The
leagues available for each source can be listed with the
available_leagues()
class method.
sd.FBref.available_leagues()
>>> ['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'ITA-Serie A']
You can add more leagues but there are no guarantees that they will be scraped correctly.
Data caching¶
Data caching is used to speed up the runtime and to prevent exceeding the rate
limit of web servers. By default, all downloaded data is cached to
~/soccerdata
on Linux and Mac OS, and to C:\Users\yourusername\soccerdata
on Windows. A custom location can be set if desired. You can configure this
using environment variables (see below) or on the level of an individual
scraper by setting the data_dir
parameter when creating the scraper class
instance:
# Create scraper class instance with custom caching directory
fbref = sd.FBref(data_dir="/tmp/FBref")
This directory can be deleted at any time to reclaim disk space. However, this also means you will have to redownload the same data again if you need it, which will lead to reduced performance.
SoccerData has no knowledge of when the data on the server changes, so it is
up to the user to decide when to refresh the cache. This can be done by
deleting the cache directory or by setting the no_cache
option to True
when creating the scraper class instance:
# Create scraper class instance which always re-downloads the latest data
fbref = sd.FBref(no_cache=True)
Some methods will assume the cache is always out-of-date (for example, when
scraping the fixture of the current season). Typically, these methods will
have a force_cache
option that can be set to True
to force the cached
data to be used. For example:
fbref = sd.FBref(leagues=['ENG-Premier League'], seasons=['2324'])
fbref.read_schedule(force_cache=True)
Caching can also be disabled entirely by setting the no_store
option to
True
when creating the scraper class instance. However, it should almost
always be left enabled.
# Create scraper class instance with caching disabled
fbref = sd.FBref(no_store=True)
Global configuration¶
Several settings can be configured globally using the following environment variables:
SOCCERDATA_DIR
The directory where the downloaded data is cached and where logs are stored. By default, all data is stored to
~/soccerdata
on Linux / Mac OS andC:\Users\yourusername\soccerdata
on Windows.SOCCERDATA_NOCACHE
If set to “true”, no cached data is returned. Note that no-cache does not mean “don’t cache”. All downloaded data is still cached and overwrites existing caches. If the sense of “don’t cache” that you want is actually “don’t store”, then
SOCCERDATA_NOSTORE
is the option to use. By default, data is retrieved from the cache.SOCCERDATA_NOSTORE
If set to “true”, no data is stored. By default, data is cached.
SOCCERDATA_MAXAGE
The maximum age of cached data in seconds. If the cached data is older than this, it will be re-downloaded. By default, this is set to infinity.
SOCCERDATA_LOGLEVEL
The level of logging to use. By default, this is set to “INFO”.
Example:
# bash
export SOCCERDATA_DIR = "~/soccerdata"
export SOCCERDATA_NOCACHE = "False"
export SOCCERDATA_NOSTORE = "False"
export SOCCERDATA_LOGLEVEL = "INFO"
Uniform team names¶
Each data source uses a different set of team names, which makes it difficult
to combine data from multiple sources. To mitigate this, SoccerData allows
translating the team names to uniform names. This is done by providing
a SOCCERDATA_DIR/config/team_dict.json
file. This file should contain a
mapping between a generic name for each team and the team name used by each
data source that you want to support. The example below will map “Tottenham
Hotspur”, “Tottenham Hotspur FC” and “Spurs” to “Tottenham” in all scraped
data.
{
"Tottenham": ["Tottenham Hotspur", "Tottenham Hotspur FC", "Spurs"],
}
Additional setup for scraping WhoScored data¶
WhoScored implements strong protection against scraping using Incapsula. To circumvent this, this scraper uses Selenium with the ChromeDriver extension to emulate a real user. Before using this scraper, you will have to install Chrome. A Selenium driver matching your Chrome version will be downloaded automatically when you run the scraper.
Next steps¶
Look at you! You’re now basically an expert at SoccerData! ✨
From this point you can:
Look at the example notebooks for each Data source.
Take a deep dive into the API.
Give us feedback or contribute, see Contributing.
Have fun! 🎉
Overview of Data Sources¶
Currently, the following data sources are supported.
- ClubElo
URL | Example usage | API reference
from soccerdata import ClubElo
Team’s relative strengths as Elo ratings, for most European leagues. Recalculated after every round, includes history.
- ESPN
URL | Example usage | API reference
from soccerdata import ESPN
Historical results, statistics and lineups.
- FBref
URL | Example usage | API reference
from soccerdata import FBref
Historical results, lineups, and detailed aggregated statistics for teams and individual players based on Stats Perform data.
- FiveThirtyEight
URL | Example usage | API reference
from soccerdata import FiveThirtyEight
Team’s relative strengths as SPI ratings, predictions and results for the top European and American leagues.
- Football-Data.co.uk
URL | Example usage | API reference
from soccerdata import MatchHistory
Historical results, betting odds and match statistics. Level of detail depends on league.
- Sofascore
URL | Example usage | API reference
from soccerdata import Sofascore
Results, schedules, lineups, and detailed statistics for teams and individual players.
- SoFIFA
URL | Example usage | API reference
from soccerdata import SoFIFA
Detailed scores on all player’s abilities from EA Sports FC.
- Understat
URL | Example usage | API reference
from soccerdata import Understat
Advanced statistics such as xG, xGBuildup and xGChain, and shot events with associated xG values for the top European leagues.
- WhoScored
URL | Example usage | API reference
from soccerdata import WhoScored
Historical results, match preview data and detailed Opta event stream data for major leagues.
[2]:
import soccerdata as sd
ClubElo¶
[3]:
elo = sd.ClubElo()
print(elo.__doc__)
Provides pd.DataFrames from CSV API at http://api.clubelo.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/ClubElo``.
Since the source does not provide league names, this class will not filter
by league. League names will be inserted from the other sources where
available. Leagues that are only covered by clubelo.com will have NaN
values.
Parameters
----------
proxy : 'tor' or or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
ELO scores for all teams at specified date¶
[4]:
current_elo = elo.read_by_date()
current_elo.head()
[4]:
rank | country | level | elo | from | to | league | |
---|---|---|---|---|---|---|---|
team | |||||||
Liverpool | 1.0 | ENG | 1 | 2047.083862 | 2022-04-20 | 2022-04-24 | ENG-Premier League |
Man City | 2.0 | ENG | 1 | 2037.059937 | 2022-04-21 | 2022-04-23 | ENG-Premier League |
Bayern | 3.0 | GER | 1 | 1984.775391 | 2022-04-18 | 2022-04-23 | GER-Bundesliga |
Real Madrid | 4.0 | ESP | 1 | 1969.584351 | 2022-04-21 | 2022-04-26 | ESP-La Liga |
Chelsea | 5.0 | ENG | 1 | 1921.101440 | 2022-04-21 | 2022-04-24 | ENG-Premier League |
Full ELO history for one club¶
[5]:
barca_elo = elo.read_team_history("Barcelona")
barca_elo.head()
[5]:
rank | team | country | level | elo | to | |
---|---|---|---|---|---|---|
from | ||||||
1939-10-22 | NaN | Barcelona | ESP | 1 | 1636.704590 | 1939-12-03 |
1939-12-04 | NaN | Barcelona | ESP | 1 | 1626.102173 | 1939-12-10 |
1939-12-11 | NaN | Barcelona | ESP | 1 | 1636.728271 | 1939-12-17 |
1939-12-18 | NaN | Barcelona | ESP | 1 | 1646.951660 | 1939-12-24 |
1939-12-25 | NaN | Barcelona | ESP | 1 | 1637.424316 | 1939-12-31 |
[2]:
import soccerdata as sd
ESPN¶
[3]:
espn = sd.ESPN(leagues="ENG-Premier League", seasons=2021)
print(espn.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:466: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
warnings.warn(msg)
Provides pd.DataFrames from JSON api available at http://site.api.espn.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/ESPN``.
Parameters
----------
leagues : string or iterable, optional
IDs of leagues to include.
seasons : string, int or list, optional
Seasons to include. Supports multiple formats.
Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
proxy : 'tor' or or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
Game schedule¶
[4]:
epl_schedule = espn.read_schedule()
epl_schedule.head()
[4]:
date | home_team | away_team | game_id | league_id | |||
---|---|---|---|---|---|---|---|
league | season | game | |||||
ENG-Premier League | 2021 | 2019-08-09 Liverpool-Norwich City | 2019-08-09 19:00:00+00:00 | Liverpool | Norwich City | 541844 | eng.1 |
2019-08-10 AFC Bournemouth-Sheffield United | 2019-08-10 14:00:00+00:00 | AFC Bournemouth | Sheffield United | 541840 | eng.1 | ||
2019-08-10 Burnley-Southampton | 2019-08-10 14:00:00+00:00 | Burnley | Southampton | 541841 | eng.1 | ||
2019-08-10 Crystal Palace-Everton | 2019-08-10 14:00:00+00:00 | Crystal Palace | Everton | 541839 | eng.1 | ||
2019-08-10 Tottenham Hotspur-Aston Villa | 2019-08-10 16:30:00+00:00 | Tottenham Hotspur | Aston Villa | 541837 | eng.1 |
Match sheet data¶
[5]:
matchsheet = espn.read_matchsheet(match_id=541465)
matchsheet.head()
[5]:
is_home | venue | attendance | capacity | roster | fouls_committed | yellow_cards | red_cards | offsides | won_corners | saves | possession_pct | total_shots | shots_on_target | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | team | ||||||||||||||
ENG-Premier League | 2021 | 2020-07-26 West Ham United-Aston Villa | Aston Villa | False | None | 0 | None | [{'active': True, 'starter': True, 'jersey': '... | 16 | 2 | 0 | 0 | 0 | 3 | 63 | 10 | 1 |
West Ham United | True | None | 0 | None | [{'active': True, 'starter': True, 'jersey': '... | 13 | 1 | 0 | 1 | 7 | 0 | 37 | 13 | 4 |
Line ups¶
[6]:
lineups = espn.read_lineup(match_id=541465)
lineups.head()
[6]:
is_home | position | formation_place | sub_in | sub_out | appearances | fouls_committed | fouls_suffered | own_goals | red_cards | sub_ins | yellow_cards | goal_assists | shots_on_target | total_goals | total_shots | goals_conceded | saves | shots_faced | offsides | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | team | player | ||||||||||||||||||||
ENG-Premier League | 2021 | 2020-07-26 West Ham United-Aston Villa | Aston Villa | Anwar El Ghazi | False | Substitute | 0 | 90 | end | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 |
Conor Hourihane | False | Center Left Midfielder | 10 | start | 76 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | ||||
Douglas Luiz | False | Defensive Midfielder | 4 | start | end | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | ||||
Ezri Konsa | False | Center Right Defender | 5 | start | end | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | ||||
Frédéric Guilbert | False | Right Back | 2 | start | 76 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 |
[3]:
import soccerdata as sd
FBref¶
[4]:
fbref = sd.FBref(leagues="ENG-Premier League", seasons=2021)
print(fbref.__doc__)
Provides pd.DataFrames from data at http://fbref.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/FBref``.
Parameters
----------
leagues : string or iterable, optional
IDs of leagues to include. For efficiently reading data from the Top-5
European leagues, use "Big 5 European Leagues Combined".
seasons : string, int or list, optional
Seasons to include. Supports multiple formats.
Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
proxy : 'tor' or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:471: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
warnings.warn(msg, stacklevel=1)
Team season stats¶
[5]:
team_season_stats = fbref.read_team_season_stats(stat_type="passing")
team_season_stats.head()
[5]:
players_used | 90s | Total | Short | Medium | Long | Ast | xAG | xA | A-xAG | KP | 1/3 | PPA | CrsPA | PrgP | url | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cmp | Att | Cmp% | TotDist | PrgDist | Cmp | Att | Cmp% | Cmp | Att | Cmp% | Cmp | Att | Cmp% | |||||||||||||||
league | season | team | ||||||||||||||||||||||||||
ENG-Premier League | 2021 | Arsenal | 29 | 38 | 18041 | 21807 | 82.7 | 306094 | 93540 | 8319 | 9236 | 90.1 | 7770 | 8814 | 88.2 | 1453 | 2621 | 55.4 | 38 | 35.1 | 35.3 | 2.9 | 332 | 1377 | 325 | 67 | 1760 | /en/squads/18bb7c10/2020-2021/Arsenal-Stats |
Aston Villa | 24 | 38 | 12938 | 16996 | 76.1 | 235927 | 93009 | 5729 | 6654 | 86.1 | 5222 | 6209 | 84.1 | 1571 | 3074 | 51.1 | 38 | 37.4 | 31.4 | 0.6 | 403 | 1086 | 323 | 76 | 1420 | /en/squads/8602292d/2020-2021/Aston-Villa-Stats | ||
Brighton | 27 | 38 | 15772 | 19871 | 79.4 | 283213 | 95248 | 6694 | 7654 | 87.5 | 6893 | 7960 | 86.6 | 1689 | 3090 | 54.7 | 24 | 33.2 | 34.9 | -9.2 | 358 | 1132 | 358 | 80 | 1516 | /en/squads/d07537b9/2020-2021/Brighton-and-Hov... | ||
Burnley | 25 | 38 | 10825 | 15562 | 69.6 | 196816 | 82192 | 5117 | 6058 | 84.5 | 3688 | 4824 | 76.5 | 1553 | 3535 | 43.9 | 20 | 27.7 | 26.9 | -7.7 | 267 | 923 | 292 | 80 | 1012 | /en/squads/943e8050/2020-2021/Burnley-Stats | ||
Chelsea | 27 | 38 | 21969 | 25799 | 85.2 | 360450 | 112392 | 10454 | 11454 | 91.3 | 8965 | 9976 | 89.9 | 1667 | 2703 | 61.7 | 38 | 41.5 | 37.1 | -3.5 | 448 | 1441 | 377 | 70 | 1798 | /en/squads/cff3d9bb/2020-2021/Chelsea-Stats |
Team match stats¶
[6]:
team_match_stats = fbref.read_team_match_stats(stat_type="schedule", team="Manchester City")
team_match_stats.head()
[6]:
date | time | round | day | venue | result | GF | GA | opponent | xG | xGA | Poss | Attendance | Captain | Formation | Referee | match_report | Notes | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | team | game | ||||||||||||||||||
ENG-Premier League | 2021 | Manchester City | 2020-09-21 Wolves-Manchester City | 2020-09-21 | 20:15:00 | Matchweek 2 | Mon | Away | W | 3 | 1 | Wolves | 1.9 | 0.6 | 65 | <NA> | Fernandinho | 4-2-3-1 | Andre Marriner | /en/matches/1c17eca3/Wolverhampton-Wanderers-M... | <NA> |
2020-09-27 Manchester City-Leicester City | 2020-09-27 | 16:30:00 | Matchweek 3 | Sun | Home | L | 2 | 5 | Leicester City | 0.9 | 2.9 | 72 | <NA> | Fernandinho | 4-2-3-1 | Michael Oliver | /en/matches/31c2a061/Manchester-City-Leicester... | <NA> | |||
2020-10-03 Leeds United-Manchester City | 2020-10-03 | 17:30:00 | Matchweek 4 | Sat | Away | D | 1 | 1 | Leeds United | 1.2 | 2.4 | 49 | <NA> | Kevin De Bruyne | 4-3-3 | Mike Dean | /en/matches/5ce15b58/Leeds-United-Manchester-C... | <NA> | |||
2020-10-17 Manchester City-Arsenal | 2020-10-17 | 17:30:00 | Matchweek 5 | Sat | Home | W | 1 | 0 | Arsenal | 1.3 | 0.9 | 58 | <NA> | Raheem Sterling | 3-1-4-2 | Chris Kavanagh | /en/matches/e95b8546/Manchester-City-Arsenal-O... | <NA> | |||
2020-10-24 West Ham-Manchester City | 2020-10-24 | 12:30:00 | Matchweek 6 | Sat | Away | D | 1 | 1 | West Ham | 1.0 | 0.3 | 69 | <NA> | Raheem Sterling | 4-3-3 | Anthony Taylor | /en/matches/2b0c0eca/West-Ham-United-Mancheste... | <NA> |
Player season stats¶
[7]:
player_season_stats = fbref.read_player_season_stats(stat_type="standard")
player_season_stats.head()
[7]:
nation | pos | age | born | Playing Time | Performance | Expected | Progression | Per 90 Minutes | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MP | Starts | Min | 90s | Gls | Ast | G+A | G-PK | PK | PKatt | CrdY | CrdR | xG | npxG | xAG | npxG+xAG | PrgC | PrgP | PrgR | Gls | Ast | G+A | G-PK | G+A-PK | xG | xAG | xG+xAG | npxG | npxG+xAG | ||||||||
league | season | team | player | |||||||||||||||||||||||||||||||||
ENG-Premier League | 2021 | Arsenal | Ainsley Maitland-Niles | ENG | MF,DF | 22 | 1997 | 11 | 5 | 490 | 5.4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1 | 0.1 | 0.1 | 0.2 | 12 | 24 | 21 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.01 | 0.02 | 0.04 | 0.01 | 0.04 |
Alexandre Lacazette | FRA | FW | 29 | 1991 | 31 | 22 | 1923 | 21.4 | 13 | 2 | 15 | 10 | 3 | 3 | 3 | 0 | 11.7 | 9.3 | 2.2 | 11.5 | 15 | 55 | 112 | 0.61 | 0.09 | 0.7 | 0.47 | 0.56 | 0.55 | 0.1 | 0.65 | 0.43 | 0.54 | |||
Bernd Leno | GER | GK | 28 | 1992 | 35 | 35 | 3131 | 34.8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 2 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |||
Bukayo Saka | ENG | FW,MF | 18 | 2001 | 32 | 30 | 2553 | 28.4 | 5 | 3 | 8 | 5 | 0 | 0 | 1 | 0 | 6.9 | 6.9 | 4.4 | 11.3 | 85 | 101 | 269 | 0.18 | 0.11 | 0.28 | 0.18 | 0.28 | 0.24 | 0.16 | 0.4 | 0.24 | 0.4 | |||
Calum Chambers | ENG | DF | 25 | 1995 | 10 | 8 | 753 | 8.4 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 0.4 | 0.4 | 1.1 | 1.5 | 16 | 40 | 49 | 0.0 | 0.24 | 0.24 | 0.0 | 0.24 | 0.05 | 0.13 | 0.18 | 0.05 | 0.18 |
Player match stats¶
[8]:
player_match_stats = fbref.read_player_match_stats(stat_type="passing", match_id='db261cb0')
player_match_stats.head()
[8]:
jersey_number | nation | pos | age | min | Total | Short | Medium | Long | Ast | xAG | xA | KP | 1/3 | PPA | CrsPA | PrgP | game_id | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Cmp | Att | Cmp% | TotDist | PrgDist | Cmp | Att | Cmp% | Cmp | Att | Cmp% | Cmp | Att | Cmp% | |||||||||||||||||||
league | season | game | team | player | ||||||||||||||||||||||||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | Crystal Palace | Andros Townsend | 10 | ENG | RM | 29-058 | 90 | 13 | 30 | 43.3 | 184 | 68 | 8 | 11 | 72.7 | 4 | 8 | 50.0 | 1 | 6 | 16.7 | 1 | 0.6 | 0.2 | 2 | 0 | 1 | 0 | 1 | db261cb0 |
Cheikhou Kouyaté | 8 | SEN | CB | 30-266 | 90 | 11 | 19 | 57.9 | 207 | 109 | 3 | 4 | 75.0 | 3 | 6 | 50.0 | 3 | 6 | 50.0 | 0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 2 | db261cb0 | ||||
Eberechi Eze | 25 | ENG | LM | 22-075 | 10 | 8 | 10 | 80.0 | 177 | 46 | 2 | 2 | 100.0 | 3 | 3 | 100.0 | 2 | 3 | 66.7 | 0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 2 | db261cb0 | ||||
James McArthur | 18 | SCO | CM | 32-341 | 90 | 15 | 32 | 46.9 | 322 | 159 | 2 | 6 | 33.3 | 11 | 15 | 73.3 | 1 | 7 | 14.3 | 0 | 0.4 | 0.0 | 1 | 2 | 0 | 0 | 3 | db261cb0 | ||||
James McCarthy | 22 | IRL | CM | 29-305 | 73 | 14 | 23 | 60.9 | 281 | 75 | 4 | 6 | 66.7 | 6 | 10 | 60.0 | 3 | 5 | 60.0 | 0 | 0.0 | 0.1 | 0 | 0 | 0 | 0 | 2 | db261cb0 |
Game schedule¶
[9]:
epl_schedule = fbref.read_schedule()
epl_schedule.head()
[9]:
week | day | date | time | home_team | home_xg | score | away_xg | away_team | attendance | venue | referee | match_report | notes | game_id | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | |||||||||||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | 1 | Sat | 2020-09-12 | 15:00 | Crystal Palace | 1.1 | 1–0 | 0.9 | Southampton | <NA> | Selhurst Park | Jonathan Moss | /en/matches/db261cb0/Crystal-Palace-Southampto... | <NA> | db261cb0 |
2020-09-12 Fulham-Arsenal | 1 | Sat | 2020-09-12 | 12:30 | Fulham | 0.1 | 0–3 | 1.9 | Arsenal | <NA> | Craven Cottage | Chris Kavanagh | /en/matches/bf52349b/Fulham-Arsenal-September-... | <NA> | bf52349b | ||
2020-09-12 Liverpool-Leeds United | 1 | Sat | 2020-09-12 | 17:30 | Liverpool | 2.7 | 4–3 | 0.3 | Leeds United | <NA> | Anfield | Michael Oliver | /en/matches/21b58926/Liverpool-Leeds-United-Se... | <NA> | 21b58926 | ||
2020-09-12 West Ham-Newcastle Utd | 1 | Sat | 2020-09-12 | 20:00 | West Ham | 1.0 | 0–2 | 1.6 | Newcastle Utd | <NA> | London Stadium | Stuart Attwell | /en/matches/78495ced/West-Ham-United-Newcastle... | <NA> | 78495ced | ||
2020-09-13 Tottenham-Everton | 1 | Sun | 2020-09-13 | 16:30 | Tottenham | 1.1 | 0–1 | 1.2 | Everton | <NA> | Tottenham Hotspur Stadium | Martin Atkinson | /en/matches/fc7f9aa1/Tottenham-Hotspur-Everton... | <NA> | fc7f9aa1 |
Line ups¶
[10]:
lineups = fbref.read_lineup(match_id='db261cb0')
lineups.head()
[10]:
jersey_number | player | team | is_starter | position | minutes_played | |||
---|---|---|---|---|---|---|---|---|
league | season | game | ||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | 31 | Vicente Guaita | Crystal Palace | True | GK | 90 |
2020-09-12 Crystal Palace-Southampton | 2 | Joel Ward | Crystal Palace | True | RB | 90 | ||
2020-09-12 Crystal Palace-Southampton | 6 | Scott Dann | Crystal Palace | True | CB | 90 | ||
2020-09-12 Crystal Palace-Southampton | 8 | Cheikhou Kouyaté | Crystal Palace | True | CB | 90 | ||
2020-09-12 Crystal Palace-Southampton | 9 | Jordan Ayew | Crystal Palace | True | FW | 90 |
Events¶
[11]:
events = fbref.read_events(match_id='db261cb0')
events.head()
[11]:
team | minute | score | player1 | player2 | event_type | |||
---|---|---|---|---|---|---|---|---|
league | season | game | ||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | Crystal Palace | 13 | 1:0 | Wilfried Zaha | Andros Townsend | goal |
2020-09-12 Crystal Palace-Southampton | Crystal Palace | 41 | 1:0 | Jeffrey Schlupp | None | yellow_card | ||
2020-09-12 Crystal Palace-Southampton | Southampton | 46 | 1:0 | Jannik Vestergaard | Jan Bednarek | substitute_in | ||
2020-09-12 Crystal Palace-Southampton | Southampton | 51 | 1:0 | Kyle Walker-Peters | None | yellow_card | ||
2020-09-12 Crystal Palace-Southampton | Crystal Palace | 58 | 1:0 | James McCarthy | None | yellow_card |
Shot events¶
[12]:
shots = fbref.read_shot_events(match_id='db261cb0')
shots.head()
[12]:
minute | player | team | xG | PSxG | outcome | distance | body_part | notes | SCA 1 | SCA 2 | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
player | event | player | event | ||||||||||||
league | season | game | |||||||||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | 4 | Jack Stephens | Southampton | 0.07 | 0.09 | Saved | 10 | Head | <NA> | James Ward-Prowse | Pass (Dead) | Kyle Walker-Peters | Pass (Dead) |
2020-09-12 Crystal Palace-Southampton | 13 | Wilfried Zaha | Crystal Palace | 0.44 | 0.73 | Goal | 9 | Right Foot | Volley | Andros Townsend | Pass (Live) | James McCarthy | Interception | ||
2020-09-12 Crystal Palace-Southampton | 32 | Danny Ings | Southampton | 0.04 | 0.12 | Saved | 10 | Head | <NA> | Kyle Walker-Peters | Pass (Live) | William Smallbone | Pass (Live) | ||
2020-09-12 Crystal Palace-Southampton | 36 | Cheikhou Kouyaté | Crystal Palace | 0.15 | 0.09 | Saved | 11 | Right Foot | <NA> | Andros Townsend | Pass (Dead) | Wilfried Zaha | Fouled | ||
2020-09-12 Crystal Palace-Southampton | 46 | Nathan Redmond | Southampton | 0.07 | 0.03 | Saved | 14 | Right Foot | <NA> | Che Adams | Pass (Live) | Danny Ings | Pass (Live) |
[3]:
import soccerdata as sd
FiveThirtyEight¶
[4]:
five38 = sd.FiveThirtyEight(leagues="ENG-Premier League", seasons=2021)
print(five38.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:466: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
warnings.warn(msg)
Provides pd.DataFrames from fivethirtyeight's "Club Soccer Predictions" project.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/FiveThirtyEight``.
Original project and background info:
https://projects.fivethirtyeight.com/soccer-predictions/ and
https://fivethirtyeight.com/features/how-our-club-soccer-projections-work/
Parameters
----------
leagues : string or iterable, optional
IDs of Leagues to include.
seasons : string, int or list, optional
Seasons to include. Supports multiple formats.
Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
proxy : 'tor' or or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
Game schedule and predicted results¶
[5]:
epl_schedule = five38.read_games()
epl_schedule.head()
[5]:
date | status | leg | home_team | away_team | home_id | away_id | home_code | away_code | prob_home | prob_away | prob_tie | round | matchday | score_home | score_away | adj_score_home | adj_score_away | chances_home | chances_away | moves_home | moves_away | aggregate_winner | shootout_winner | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | ||||||||||||||||||||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | 2020-09-12 14:00:00+00:00 | post | None | Crystal Palace | Southampton | 384 | 376 | CRY | SOU | 0.36979 | 0.35311 | 0.27710 | None | None | 1 | 0 | 1.05 | 0.000 | 1.507 | 1.160 | 0.943 | 1.652 | None | None |
2020-09-12 Fulham-Arsenal | 2020-09-12 11:30:00+00:00 | post | None | Fulham | Arsenal | 370 | 359 | FUL | ARS | 0.28167 | 0.46081 | 0.25752 | None | None | 0 | 3 | 0.00 | 3.150 | 0.141 | 2.989 | 0.270 | 1.528 | None | None | ||
2020-09-12 Liverpool-Leeds United | 2020-09-12 16:30:00+00:00 | post | None | Liverpool | Leeds United | 364 | 357 | LIV | LEE | 0.83582 | 0.04384 | 0.12034 | None | None | 4 | 3 | 4.20 | 3.150 | 3.328 | 0.368 | 3.006 | 0.747 | None | None | ||
2020-09-12 West Ham United-Newcastle | 2020-09-12 19:00:00+00:00 | post | None | West Ham United | Newcastle | 371 | 361 | WHU | NEW | 0.48865 | 0.25556 | 0.25579 | None | None | 0 | 2 | 0.00 | 1.666 | 0.973 | 1.475 | 1.103 | 1.860 | None | None | ||
2020-09-13 Tottenham Hotspur-Everton | 2020-09-13 15:30:00+00:00 | post | None | Tottenham Hotspur | Everton | 367 | 368 | TOT | EVE | 0.50235 | 0.24434 | 0.25331 | None | None | 0 | 1 | 0.00 | 1.050 | 1.029 | 1.109 | 1.316 | 1.110 | None | None |
Forecasted league table¶
[6]:
forecasts = five38.read_forecasts()
forecasts.head()
[6]:
code | current_losses | current_points | current_ties | current_wins | d_rating | global_d | global_o | global_rating | goal_diff | goals_against | goals_scored | id | losses | make_champs | make_europa | make_playoffs | o_rating | points | position_1 | position_10 | position_11 | position_12 | position_13 | position_14 | position_15 | position_16 | position_17 | position_18 | position_19 | position_2 | position_20 | position_3 | position_4 | position_5 | position_6 | position_7 | position_8 | position_9 | promoted | relegated | sdr_id | ties | win_league | wins | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | last_updated | team | |||||||||||||||||||||||||||||||||||||||||||||
ENG-Premier League | 2021 | 2020-09-09T14:46:49.136Z | Arsenal | ARS | 0.0 | 0.0 | 0.0 | 0.0 | 1.33114 | 0.68339 | 2.13421 | 77.16360 | 4.13260 | 49.72715 | 53.85975 | 359 | 13.27675 | 0.16095 | 0.0 | 0.0 | 1.48647 | 55.50395 | 0.00720 | 0.06905 | 0.06170 | 0.05220 | 0.04835 | 0.04075 | 0.03595 | 0.03045 | 0.02650 | 0.02035 | 0.01590 | 0.02725 | 0.01165 | 0.05280 | 0.07370 | 0.08895 | 0.09120 | 0.08990 | 0.08340 | 0.07275 | 0.0 | 0.04790 | 73616 | 9.3329 | 0.00720 | 15.39035 |
Aston Villa | AVL | 0.0 | 0.0 | 0.0 | 0.0 | 1.52263 | 0.87488 | 1.77744 | 66.48339 | -17.57325 | 57.65595 | 40.08270 | 362 | 17.72125 | 0.02335 | 0.0 | 0.0 | 1.12969 | 41.66985 | 0.00030 | 0.05405 | 0.06005 | 0.06500 | 0.06855 | 0.07090 | 0.07480 | 0.07685 | 0.07955 | 0.08465 | 0.08315 | 0.00315 | 0.08465 | 0.00615 | 0.01375 | 0.02105 | 0.02830 | 0.03375 | 0.04145 | 0.04990 | 0.0 | 0.25245 | 73597 | 9.5832 | 0.00030 | 10.69555 | |||
Brighton and Hove Albion | BHA | 0.0 | 0.0 | 0.0 | 0.0 | 1.45713 | 0.80938 | 1.77377 | 68.02657 | -15.24810 | 55.36300 | 40.11490 | 331 | 17.18525 | 0.02950 | 0.0 | 0.0 | 1.12603 | 42.92225 | 0.00075 | 0.05475 | 0.06255 | 0.06575 | 0.07025 | 0.07305 | 0.07485 | 0.07690 | 0.07505 | 0.07790 | 0.07590 | 0.00295 | 0.06700 | 0.00965 | 0.01615 | 0.02415 | 0.03275 | 0.04185 | 0.04450 | 0.05330 | 0.0 | 0.22080 | 73651 | 9.7610 | 0.00075 | 11.05375 | |||
Burnley | BRN | 0.0 | 0.0 | 0.0 | 0.0 | 1.41912 | 0.77137 | 1.77961 | 69.08173 | -13.63310 | 53.84085 | 40.20775 | 379 | 16.80205 | 0.03210 | 0.0 | 0.0 | 1.13187 | 43.86545 | 0.00100 | 0.05910 | 0.06525 | 0.06555 | 0.07245 | 0.06670 | 0.07370 | 0.07325 | 0.07340 | 0.07010 | 0.06735 | 0.00325 | 0.06145 | 0.01065 | 0.01720 | 0.02915 | 0.03935 | 0.04365 | 0.05110 | 0.05635 | 0.0 | 0.19890 | 73618 | 9.8642 | 0.00100 | 11.33375 | |||
Chelsea | CHE | 0.0 | 0.0 | 0.0 | 0.0 | 1.16934 | 0.52160 | 2.51615 | 85.37204 | 26.57990 | 42.20305 | 68.78295 | 363 | 9.34340 | 0.53090 | 0.0 | 0.0 | 1.86841 | 69.17900 | 0.06525 | 0.02775 | 0.02295 | 0.01715 | 0.01315 | 0.01000 | 0.00735 | 0.00585 | 0.00400 | 0.00230 | 0.00215 | 0.13860 | 0.00050 | 0.17560 | 0.15145 | 0.11230 | 0.08610 | 0.06630 | 0.04965 | 0.04160 | 0.0 | 0.00495 | 73598 | 8.3954 | 0.06525 | 20.26120 |
Clinches¶
[7]:
clinches = five38.read_clinches()
clinches.head()
[7]:
typ | team | |||
---|---|---|---|---|
league | season | date | ||
ENG-Premier League | 2021 | 2021-02-13 | relegated_elim | Manchester City |
2021-02-17 | win_league_elim | Sheffield United | ||
2021-02-17 | win_league_elim | West Bromwich Albion | ||
2021-02-27 | win_league_elim | Fulham | ||
2021-03-02 | win_league_elim | Newcastle |
[1]:
import pandas as pd
pd.set_option('display.max_columns', None)
[3]:
import soccerdata as sd
FotMob¶
[4]:
fotmob = sd.FotMob(leagues='ESP-La Liga', seasons='2022/2023')
print(fotmob.__doc__)
Provides pd.DataFrames from data available at http://www.fotmob.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/FotMob``.
Parameters
----------
leagues : string or iterable, optional
IDs of Leagues to include.
seasons : string, int or list, optional
Seasons to include. Supports multiple formats.
Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
proxy : 'tor' or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- 'tor': Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
Read league table¶
[5]:
league_table = fotmob.read_league_table()
league_table.head()
[5]:
team | MP | W | D | L | GF | GA | GD | Pts | ||
---|---|---|---|---|---|---|---|---|---|---|
league | season | |||||||||
ESP-La Liga | 2223 | Barcelona | 38 | 28 | 4 | 6 | 70 | 20 | 50 | 88 |
2223 | Real Madrid | 38 | 24 | 6 | 8 | 75 | 36 | 39 | 78 | |
2223 | Atletico Madrid | 38 | 23 | 8 | 7 | 70 | 33 | 37 | 77 | |
2223 | Real Sociedad | 38 | 21 | 8 | 9 | 51 | 35 | 16 | 71 | |
2223 | Villarreal | 38 | 19 | 7 | 12 | 59 | 40 | 19 | 64 |
Read schedule¶
[6]:
schedule = fotmob.read_schedule()
schedule.head()
[6]:
round | week | date | home_team | away_team | home_score | away_score | status | game_id | url | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | ||||||||||
ESP-La Liga | 2223 | 2022-08-12 Osasuna-Sevilla | None | None | 2022-08-12 19:00:00+00:00 | Osasuna | Sevilla | 2 | 1 | FT | 3918013 | https://fotmob.com/matches/sevilla-vs-osasuna/... |
2022-08-13 Barcelona-Rayo Vallecano | None | None | 2022-08-13 19:00:00+00:00 | Barcelona | Rayo Vallecano | 0 | 0 | FT | 3918009 | https://fotmob.com/matches/rayo-vallecano-vs-b... | ||
2022-08-13 Celta Vigo-Espanyol | None | None | 2022-08-13 15:00:00+00:00 | Celta Vigo | Espanyol | 2 | 2 | FT | 3918011 | https://fotmob.com/matches/espanyol-vs-celta-v... | ||
2022-08-13 Real Valladolid-Villarreal | None | None | 2022-08-13 17:00:00+00:00 | Real Valladolid | Villarreal | 0 | 3 | FT | 3918016 | https://fotmob.com/matches/villarreal-vs-real-... | ||
2022-08-14 Almeria-Real Madrid | None | None | 2022-08-14 20:00:00+00:00 | Almeria | Real Madrid | 1 | 2 | FT | 3918014 | https://fotmob.com/matches/real-madrid-vs-alme... |
Read team match stats¶
[7]:
match_stats = fotmob.read_team_match_stats(opponent_stats=False, team='Valencia')
match_stats.head()
[7]:
Accurate passes | Ball possession | Big chances | Big chances missed | Corners | Expected goals (xG) | Fouls committed | Shots on target | Total shots | Accurate passes (%) | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | team | ||||||||||
ESP-La Liga | 2223 | 2022-08-14 Valencia-Girona | Valencia | 285 | 46 | 2 | 1 | 4 | 1.62 | 18 | 2 | 17 | 0.81 |
2022-08-21 Athletic Club-Valencia | Valencia | 432 | 63 | 1 | 1 | 3 | 0.73 | 15 | 0 | 10 | 0.85 | ||
2022-08-29 Valencia-Atletico Madrid | Valencia | 513 | 70 | 0 | 0 | 6 | 0.51 | 15 | 2 | 12 | 0.88 | ||
2022-09-04 Valencia-Getafe | Valencia | 482 | 70 | 4 | 1 | 9 | 2.30 | 11 | 9 | 24 | 0.88 | ||
2022-09-10 Rayo Vallecano-Valencia | Valencia | 266 | 54 | 3 | 2 | 7 | 1.80 | 11 | 4 | 17 | 0.77 |
[3]:
import soccerdata as sd
Match History¶
[4]:
mh = sd.MatchHistory(leagues="ENG-Premier League", seasons=2021)
print(mh.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:466: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
warnings.warn(msg)
Provides pd.DataFrames from CSV files available at http://www.football-data.co.uk/data.php.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/MatchHistory``.
Parameters
----------
leagues : string or iterable
IDs of leagues to include.
seasons : string, int or list
Seasons to include. Supports multiple formats.
Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
proxy : 'tor' or or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path, optional
Path to directory where data will be cached.
Historic match results and betting odds¶
[5]:
hist = mh.read_games()
hist.head()
[5]:
date | home_team | away_team | FTHG | FTAG | FTR | HTHG | HTAG | HTR | referee | HS | AS | HST | AST | HF | AF | HC | AC | HY | AY | HR | AR | B365H | B365D | B365A | BWH | BWD | BWA | IWH | IWD | IWA | PSH | PSD | PSA | WHH | WHD | WHA | VCH | VCD | VCA | MaxH | MaxD | MaxA | AvgH | AvgD | AvgA | B365>2.5 | B365<2.5 | P>2.5 | P<2.5 | Max>2.5 | Max<2.5 | Avg>2.5 | Avg<2.5 | AHh | B365AHH | B365AHA | PAHH | PAHA | MaxAHH | MaxAHA | AvgAHH | AvgAHA | B365CH | B365CD | B365CA | BWCH | BWCD | BWCA | IWCH | IWCD | IWCA | PSCH | PSCD | PSCA | WHCH | WHCD | WHCA | VCCH | VCCD | VCCA | MaxCH | MaxCD | MaxCA | AvgCH | AvgCD | AvgCA | B365C>2.5 | B365C<2.5 | PC>2.5 | PC<2.5 | MaxC>2.5 | MaxC<2.5 | AvgC>2.5 | AvgC<2.5 | AHCh | B365CAHH | B365CAHA | PCAHH | PCAHA | MaxCAHH | MaxCAHA | AvgCAHH | AvgCAHA | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ENG-Premier League | 2021 | 2020-01-11 Aston Villa-Southampton | 2020-01-11 12:00:00 | Aston Villa | Southampton | 3 | 4 | A | 0 | 3 | A | D England | 19 | 9 | 10 | 4 | 12 | 17 | 11 | 1 | 1 | 2 | 0 | 0 | 2.45 | 3.5 | 2.80 | 2.35 | 3.60 | 2.85 | 2.50 | 3.35 | 2.75 | 2.53 | 3.72 | 2.81 | 2.50 | 3.5 | 2.75 | 2.40 | 3.60 | 2.80 | 2.60 | 3.84 | 2.85 | 2.46 | 3.60 | 2.78 | 1.66 | 2.2 | 1.66 | 2.37 | 1.69 | 2.38 | 1.65 | 2.27 | 0.00 | 1.83 | 2.07 | 1.87 | 2.07 | 1.87 | 2.09 | 1.83 | 2.05 | 2.60 | 3.50 | 2.62 | 2.50 | 3.60 | 2.70 | 2.55 | 3.50 | 2.70 | 2.69 | 3.56 | 2.73 | 2.60 | 3.50 | 2.62 | 2.50 | 3.60 | 2.70 | 2.70 | 3.70 | 2.80 | 2.57 | 3.55 | 2.69 | 1.66 | 2.20 | 1.70 | 2.29 | 1.71 | 2.37 | 1.66 | 2.24 | 0.00 | 1.93 | 1.97 | 1.95 | 1.98 | 1.97 | 2.04 | 1.89 | 1.98 |
2020-01-11 Man United-Arsenal | 2020-01-11 16:30:00 | Man United | Arsenal | 0 | 1 | A | 0 | 0 | D | M Dean | 8 | 7 | 2 | 2 | 12 | 12 | 6 | 3 | 3 | 3 | 0 | 0 | 2.00 | 3.6 | 3.60 | 2.00 | 3.70 | 3.60 | 2.05 | 3.45 | 3.55 | 2.05 | 3.78 | 3.73 | 2.05 | 3.6 | 3.60 | 2.05 | 3.60 | 3.50 | 2.10 | 3.85 | 3.76 | 2.04 | 3.66 | 3.58 | 1.72 | 2.1 | 1.76 | 2.19 | 1.81 | 2.22 | 1.74 | 2.12 | -0.50 | 2.05 | 1.85 | 2.05 | 1.88 | 2.09 | 1.90 | 2.04 | 1.84 | 1.95 | 3.60 | 3.80 | 1.95 | 3.60 | 3.90 | 2.00 | 3.50 | 3.80 | 2.02 | 3.65 | 3.97 | 2.00 | 3.50 | 3.80 | 1.95 | 3.50 | 4.00 | 2.06 | 3.75 | 4.00 | 1.99 | 3.59 | 3.87 | 1.90 | 1.90 | 1.96 | 1.96 | 1.99 | 2.10 | 1.88 | 1.95 | -0.50 | 2.00 | 1.90 | 2.03 | 1.90 | 2.05 | 1.92 | 2.00 | 1.88 | ||
2020-01-11 Newcastle-Everton | 2020-01-11 14:00:00 | Newcastle | Everton | 2 | 1 | H | 0 | 0 | D | S Attwell | 11 | 15 | 4 | 4 | 9 | 10 | 5 | 4 | 2 | 4 | 0 | 0 | 3.70 | 3.4 | 2.05 | 3.75 | 3.40 | 2.05 | 3.55 | 3.30 | 2.10 | 3.82 | 3.56 | 2.09 | 3.70 | 3.4 | 2.05 | 3.50 | 3.50 | 2.05 | 4.00 | 3.59 | 2.15 | 3.71 | 3.47 | 2.06 | 1.80 | 2.0 | 1.86 | 2.07 | 1.88 | 2.08 | 1.82 | 2.01 | 0.25 | 2.05 | 1.75 | 2.15 | 1.80 | 2.17 | 1.83 | 2.12 | 1.77 | 3.30 | 3.30 | 2.25 | 3.30 | 3.30 | 2.25 | 3.15 | 3.35 | 2.30 | 3.34 | 3.43 | 2.33 | 3.40 | 3.25 | 2.25 | 3.30 | 3.30 | 2.25 | 3.62 | 3.50 | 2.34 | 3.34 | 3.33 | 2.27 | 2.00 | 1.80 | 2.12 | 1.81 | 2.14 | 1.92 | 2.04 | 1.80 | 0.25 | 1.95 | 1.95 | 1.93 | 2.00 | 1.99 | 2.03 | 1.93 | 1.95 | ||
2020-01-11 Tottenham-Brighton | 2020-01-11 19:15:00 | Tottenham | Brighton | 2 | 1 | H | 1 | 0 | H | G Scott | 9 | 6 | 3 | 2 | 14 | 13 | 4 | 5 | 2 | 1 | 0 | 0 | 1.61 | 4.0 | 5.50 | 1.62 | 4.25 | 5.00 | 1.65 | 3.90 | 5.00 | 1.65 | 4.26 | 5.54 | 1.63 | 4.0 | 5.25 | 1.62 | 4.20 | 5.25 | 1.67 | 4.30 | 5.75 | 1.63 | 4.17 | 5.29 | 1.66 | 2.2 | 1.68 | 2.32 | 1.71 | 2.35 | 1.65 | 2.26 | -1.00 | 2.02 | 1.77 | 2.15 | 1.80 | 2.16 | 1.84 | 2.09 | 1.80 | 1.55 | 4.33 | 6.00 | 1.57 | 4.25 | 5.50 | 1.60 | 4.20 | 5.75 | 1.61 | 4.36 | 5.77 | 1.57 | 4.00 | 6.00 | 1.57 | 4.20 | 5.75 | 1.63 | 4.45 | 6.39 | 1.59 | 4.29 | 5.62 | 1.66 | 2.20 | 1.68 | 2.32 | 1.72 | 2.37 | 1.65 | 2.27 | -1.00 | 2.06 | 1.84 | 2.11 | 1.84 | 2.13 | 1.99 | 2.02 | 1.86 | ||
2020-02-11 Fulham-West Brom | 2020-02-11 17:30:00 | Fulham | West Brom | 2 | 0 | H | 2 | 0 | H | S Hooper | 13 | 10 | 6 | 1 | 13 | 11 | 5 | 2 | 3 | 2 | 0 | 0 | 2.40 | 3.3 | 3.00 | 2.40 | 3.30 | 3.00 | 2.45 | 3.10 | 3.00 | 2.47 | 3.40 | 3.13 | 2.40 | 3.3 | 3.00 | 2.45 | 3.25 | 3.00 | 2.56 | 3.50 | 3.17 | 2.44 | 3.32 | 3.01 | 2.00 | 1.8 | 2.13 | 1.80 | 2.14 | 1.85 | 2.06 | 1.78 | -0.25 | 2.10 | 1.80 | 2.11 | 1.84 | 2.15 | 1.86 | 2.10 | 1.80 | 2.40 | 3.30 | 3.00 | 2.45 | 3.30 | 2.95 | 2.40 | 3.05 | 3.10 | 2.48 | 3.32 | 3.19 | 2.40 | 3.20 | 3.10 | 2.40 | 3.25 | 3.10 | 2.48 | 3.45 | 3.22 | 2.41 | 3.27 | 3.09 | 2.05 | 1.85 | 2.05 | 1.88 | 2.11 | 1.92 | 2.02 | 1.81 | -0.25 | 2.06 | 1.84 | 2.11 | 1.84 | 2.12 | 1.87 | 2.08 | 1.82 |
[3]:
import soccerdata as sd
SoFIFA¶
[4]:
sofifa = sd.SoFIFA(leagues="ENG-Premier League", versions="latest")
print(sofifa.__doc__)
Provides pd.DataFrames from data at http://sofifa.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/SoFIFA``.
Parameters
----------
leagues : string or iterable, optional
IDs of leagues to include.
versions : string, int or list of int, optional
FIFA releases to include. Should be specified by their ID used in the URL
(e.g., 230034). Alternatively, the string "all" can be used to include all
versions and "latest" to include the latest version only. Defaults to
"latest".
proxy : 'tor' or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
EA Sports FIFA team ratings¶
[5]:
team_ratings = sofifa.read_team_ratings()
team_ratings.head()
[5]:
overall | attack | midfield | defence | transfer_budget | players | fifa_edition | update | ||
---|---|---|---|---|---|---|---|---|---|
league | team | ||||||||
ENG-Premier League | AFC Bournemouth | 74 | 75 | 73 | 73 | €0 | 29 | FIFA 23 | May 26, 2023 |
Arsenal | 82 | 82 | 84 | 81 | €0 | 32 | FIFA 23 | May 26, 2023 | |
Aston Villa | 79 | 79 | 77 | 79 | €0 | 32 | FIFA 23 | May 26, 2023 | |
Brentford | 75 | 74 | 75 | 76 | €0 | 31 | FIFA 23 | May 26, 2023 | |
Brighton & Hove Albion | 77 | 75 | 77 | 78 | €0 | 28 | FIFA 23 | May 26, 2023 |
EA Sports FIFA player ratings¶
[6]:
player_ratings = sofifa.read_player_ratings(team="Arsenal")
player_ratings.head()
[6]:
fifa_edition | update | overallrating | potential | crossing | finishing | headingaccuracy | shortpassing | volleys | dribbling | curve | fk_accuracy | longpassing | ballcontrol | acceleration | sprintspeed | agility | reactions | balance | shotpower | jumping | stamina | strength | longshots | aggression | interceptions | positioning | vision | penalties | composure | defensiveawareness | standingtackle | slidingtackle | gk_diving | gk_handling | gk_kicking | gk_positioning | gk_reflexes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
player | ||||||||||||||||||||||||||||||||||||||
Aaron Ramsdale | FIFA 23 | May 26, 2023 | 83 | 86 | 15 | 14 | 15 | 56 | 16 | 18 | 17 | 16 | 57 | 36 | 47 | 50 | 53 | 82 | 49 | 65 | 66 | 35 | 59 | 15 | 35 | 24 | 18 | 65 | 25 | 65 | 20 | 16 | 15 | 84 | 79 | 87 | 79 | 86 |
Ainsley Maitland-Niles | FIFA 23 | May 26, 2023 | 73 | 75 | 67 | 58 | 54 | 74 | 44 | 71 | 64 | 50 | 66 | 73 | 79 | 77 | 79 | 70 | 69 | 67 | 66 | 69 | 67 | 62 | 65 | 70 | 69 | 67 | 69 | 70 | 66 | 74 | 72 | 15 | 9 | 7 | 10 | 13 |
Albert-Mboyo Sambi Lokonga | FIFA 23 | May 26, 2023 | 75 | 82 | 68 | 53 | 61 | 79 | 58 | 78 | 71 | 68 | 77 | 81 | 63 | 69 | 78 | 72 | 68 | 68 | 71 | 77 | 63 | 63 | 72 | 77 | 65 | 75 | 41 | 80 | 64 | 73 | 68 | 14 | 7 | 6 | 13 | 9 |
Amario Cozier-Duberry | FIFA 23 | May 26, 2023 | 59 | 77 | 58 | 57 | 38 | 55 | 60 | 60 | 53 | 41 | 44 | 58 | 72 | 69 | 73 | 50 | 68 | 56 | 44 | 55 | 45 | 45 | 55 | 27 | 59 | 56 | 55 | 55 | 28 | 28 | 24 | 11 | 12 | 9 | 13 | 7 |
Arthur Okonkwo | FIFA 23 | May 26, 2023 | 66 | 75 | 10 | 6 | 12 | 33 | 7 | 9 | 12 | 20 | 20 | 17 | 44 | 44 | 40 | 63 | 31 | 47 | 50 | 29 | 50 | 7 | 20 | 11 | 6 | 35 | 19 | 30 | 13 | 13 | 12 | 64 | 65 | 63 | 67 | 66 |
[7]:
sofifa.read_teams()
[7]:
team | league | fifa_edition | update | |
---|---|---|---|---|
team_id | ||||
10 | Manchester City | ENG-Premier League | FIFA 23 | May 26, 2023 |
9 | Liverpool | ENG-Premier League | FIFA 23 | May 26, 2023 |
1 | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
5 | Chelsea | ENG-Premier League | FIFA 23 | May 26, 2023 |
11 | Manchester United | ENG-Premier League | FIFA 23 | May 26, 2023 |
18 | Tottenham Hotspur | ENG-Premier League | FIFA 23 | May 26, 2023 |
13 | Newcastle United | ENG-Premier League | FIFA 23 | May 26, 2023 |
2 | Aston Villa | ENG-Premier League | FIFA 23 | May 26, 2023 |
19 | West Ham United | ENG-Premier League | FIFA 23 | May 26, 2023 |
95 | Leicester City | ENG-Premier League | FIFA 23 | May 26, 2023 |
110 | Wolverhampton Wanderers | ENG-Premier League | FIFA 23 | May 26, 2023 |
14 | Nottingham Forest | ENG-Premier League | FIFA 23 | May 26, 2023 |
1808 | Brighton & Hove Albion | ENG-Premier League | FIFA 23 | May 26, 2023 |
7 | Everton | ENG-Premier League | FIFA 23 | May 26, 2023 |
1799 | Crystal Palace | ENG-Premier League | FIFA 23 | May 26, 2023 |
144 | Fulham | ENG-Premier League | FIFA 23 | May 26, 2023 |
8 | Leeds United | ENG-Premier League | FIFA 23 | May 26, 2023 |
17 | Southampton | ENG-Premier League | FIFA 23 | May 26, 2023 |
1925 | Brentford | ENG-Premier League | FIFA 23 | May 26, 2023 |
1943 | AFC Bournemouth | ENG-Premier League | FIFA 23 | May 26, 2023 |
[8]:
sofifa.read_players(team="Arsenal")
[8]:
player | team | league | fifa_edition | update | |
---|---|---|---|---|---|
player_id | |||||
233934 | Aaron Ramsdale | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
231936 | Benjamin White | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
243715 | William Saliba | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
232580 | Gabriel dos S. Magalhães | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
227813 | Oleksandr Zinchenko | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
205498 | Luiz Frello Filho Jorge | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
222665 | Martin Ødegaard | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
199503 | Granit Xhaka | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
246669 | Bukayo Saka | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
230666 | Gabriel Fernando de Jesus | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
251566 | Gabriel Teodoro Martinelli Silva | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
207421 | Leandro Trossard | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
236988 | Eddie Nketiah | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
209989 | Thomas Partey | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
256958 | Fábio Daniel Ferreira Vieira | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
226491 | Kieran Tierney | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
265693 | Jakub Kiwior | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
233267 | Matt Turner | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
228295 | Rob Holding | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
240273 | Emile Smith Rowe | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
231448 | Reiss Nelson | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
211454 | Mohamed Elneny | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
232938 | Takehiro Tomiyasu | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
247731 | Matthew Smith | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
248732 | Karl Hein | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
259403 | Joel Ideho | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
260141 | Cătălin Cîrjan | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
259095 | George Lewis | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
271743 | Amario Cozier-Duberry | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
271808 | Lino Sousa | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
259097 | Salah-Eddine Oulad M'Hand | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
274387 | Reuell Walters | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
270390 | Marcus Vincius Oliveira Alencar | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
260653 | Brooke Norton-Cuffy | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
260144 | Charlie Patino | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
251530 | Nuno Albertino Varela Tavares | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
247463 | Folarin Balogun | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
246671 | Arthur Okonkwo | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
241928 | Albert Sambi Lokonga | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
235755 | Auston Trusty | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
226110 | Nicolas Pépé | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
225782 | Ainsley Maitland-Niles | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
222562 | Rúnar Alex Rúnarsson | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
206654 | Pablo Marí Villar | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
201118 | Cédric Ricardo Alves Soares | Arsenal | ENG-Premier League | FIFA 23 | May 26, 2023 |
[2]:
import soccerdata as sd
Understat¶
[3]:
understat = sd.Understat(leagues="ENG-Premier League", seasons="2015/2016")
print(understat.__doc__)
Provides pd.DataFrames from data at https://understat.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/Understat``.
Parameters
----------
proxy : 'tor' or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
Leagues¶
[4]:
leagues = understat.read_leagues()
leagues.head()
[4]:
league_id | url | |
---|---|---|
league | ||
ENG-Premier League | 1 | https://understat.com/league/EPL |
Seasons¶
[5]:
seasons = understat.read_seasons()
seasons.head()
[5]:
league_id | season_id | url | ||
---|---|---|---|---|
league | season | |||
ENG-Premier League | 1516 | 1 | 2015 | https://understat.com/league/EPL/2015 |
Schedule¶
[6]:
schedule = understat.read_schedule()
schedule.head()
[6]:
league_id | season_id | game_id | date | home_team_id | away_team_id | home_team | away_team | away_team_code | home_team_code | home_goals | away_goals | home_xg | away_xg | is_result | has_data | url | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | |||||||||||||||||
ENG-Premier League | 1516 | 2015-08-08 Bournemouth-Aston Villa | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 73 | 71 | Bournemouth | Aston Villa | AVL | BOU | 0 | 1 | 0.876106 | 0.782253 | True | True | https://understat.com/match/82 |
2015-08-08 Chelsea-Swansea | 1 | 2015 | 86 | 2015-08-08 20:30:00 | 80 | 84 | Chelsea | Swansea | SWA | CHE | 2 | 2 | 0.64396 | 2.59203 | True | True | https://understat.com/match/86 | ||
2015-08-08 Everton-Watford | 1 | 2015 | 83 | 2015-08-08 18:00:00 | 72 | 90 | Everton | Watford | WAT | EVE | 2 | 2 | 0.604226 | 0.557892 | True | True | https://understat.com/match/83 | ||
2015-08-08 Leicester-Sunderland | 1 | 2015 | 84 | 2015-08-08 18:00:00 | 75 | 77 | Leicester | Sunderland | SUN | LEI | 4 | 2 | 2.56803 | 1.45946 | True | True | https://understat.com/match/84 | ||
2015-08-08 Manchester United-Tottenham | 1 | 2015 | 81 | 2015-08-08 15:45:00 | 89 | 82 | Manchester United | Tottenham | TOT | MUN | 1 | 0 | 0.627539 | 0.6746 | True | True | https://understat.com/match/81 |
Team match stats¶
[7]:
team_match_stats = understat.read_team_match_stats()
team_match_stats.head()
[7]:
league_id | season_id | game_id | date | home_team_id | away_team_id | home_team | away_team | away_team_code | home_team_code | ... | away_ppda | away_deep_completions | home_points | home_expected_points | home_goals | home_xg | home_np_xg | home_np_xg_difference | home_ppda | home_deep_completions | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | |||||||||||||||||||||
ENG-Premier League | 1516 | 2015-08-08 Bournemouth-Aston Villa | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 73 | 71 | Bournemouth | Aston Villa | AVL | BOU | ... | 11.846154 | 2 | 0 | 1.3912 | 0 | 0.876106 | 0.876106 | 0.093853 | 6.9 | 11 |
2015-08-08 Chelsea-Swansea | 1 | 2015 | 86 | 2015-08-08 20:30:00 | 80 | 84 | Chelsea | Swansea | SWA | CHE | ... | 8.833333 | 5 | 1 | 0.1836 | 2 | 0.64396 | 0.64396 | -1.1869 | 10.363636 | 10 | ||
2015-08-08 Everton-Watford | 1 | 2015 | 83 | 2015-08-08 18:00:00 | 72 | 90 | Everton | Watford | WAT | EVE | ... | 17.157895 | 4 | 1 | 1.3301 | 2 | 0.604226 | 0.604226 | 0.046334 | 6.65 | 5 | ||
2015-08-08 Leicester-Sunderland | 1 | 2015 | 84 | 2015-08-08 18:00:00 | 75 | 77 | Leicester | Sunderland | SUN | LEI | ... | 9.555556 | 6 | 3 | 2.1323 | 4 | 2.56803 | 1.80686 | 0.3474 | 10.88 | 5 | ||
2015-08-08 Manchester United-Tottenham | 1 | 2015 | 81 | 2015-08-08 15:45:00 | 89 | 82 | Manchester United | Tottenham | TOT | MUN | ... | 8.21875 | 10 | 3 | 1.2482 | 1 | 0.627539 | 0.627539 | -0.047061 | 13.826087 | 4 |
5 rows × 26 columns
Player season stats¶
[8]:
player_season_stats = understat.read_player_season_stats()
player_season_stats.head()
[8]:
league_id | season_id | team | team_id | player_id | position | matches | minutes | goals | xg | np_goals | np_xg | assists | xa | shots | key_passes | yellow_cards | red_cards | xg_chain | xg_buildup | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | player | ||||||||||||||||||||
ENG-Premier League | 1516 | Aaron Cresswell | 1 | 2015 | West Ham | 81 | 534 | D M | 37 | 3319 | 2 | 1.092563 | 2 | 1.092563 | 4 | 3.294618 | 28 | 40 | 1 | 0 | 10.339367 | 6.831139 |
Aaron Lennon | 1 | 2015 | Everton | 72 | 593 | F M S | 25 | 1534 | 5 | 2.226095 | 5 | 2.226095 | 0 | 1.697951 | 18 | 23 | 1 | 0 | 5.535392 | 1.90374 | ||
Aaron Ramsey | 1 | 2015 | Arsenal | 83 | 504 | M S | 31 | 2624 | 5 | 8.614323 | 5 | 8.614323 | 4 | 4.046242 | 69 | 37 | 4 | 0 | 23.282566 | 14.176444 | ||
Abdul Rahman Baba | 1 | 2015 | Chelsea | 80 | 684 | D S | 15 | 1018 | 0 | 0.379115 | 0 | 0.379115 | 1 | 0.179241 | 6 | 3 | 1 | 0 | 3.869669 | 3.490554 | ||
Adam Bogdan | 1 | 2015 | Liverpool | 87 | 698 | GK | 2 | 180 | 0 | 0.0 | 0 | 0.0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.120216 | 0.120216 |
Player match stats¶
[9]:
player_match_stats = understat.read_player_match_stats()
player_match_stats.head()
[9]:
league_id | season_id | game_id | team_id | player_id | position | position_id | minutes | goals | own_goals | shots | xg | xa | xg_chain | xg_buildup | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | team | player | |||||||||||||||
ENG-Premier League | 1516 | 2015-08-08 Bournemouth-Aston Villa | Aston Villa | Ashley Westwood | 1 | 2015 | 82 | 71 | 669 | MC | 9 | 90 | 0 | 0 | 0 | 0.0 | 0.374082 | 0.131937 | 0.131937 |
Brad Guzan | 1 | 2015 | 82 | 71 | 662 | GK | 1 | 90 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||
Carlos Sánchez | 1 | 2015 | 82 | 71 | 667 | Sub | 17 | 18 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | ||||
Ciaran Clark | 1 | 2015 | 82 | 71 | 875 | DC | 3 | 90 | 0 | 0 | 0 | 0.0 | 0.0 | 0.131937 | 0.131937 | ||||
Gabriel Agbonlahor | 1 | 2015 | 82 | 71 | 890 | FW | 15 | 90 | 0 | 0 | 2 | 0.13016 | 0.113668 | 0.243828 | 0.0 |
Shot events¶
[10]:
shot_events = understat.read_shot_events()
shot_events.head()
[10]:
league_id | season_id | game_id | date | shot_id | team_id | player_id | assist_player_id | assist_player | xg | location_x | location_y | minute | body_part | situation | result | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | team | player | ||||||||||||||||
ENG-Premier League | 1516 | 2015-08-08 Bournemouth-Aston Villa | Aston Villa | Gabriel Agbonlahor | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 487146 | 71 | 890 | <NA> | <NA> | 0.075484 | 0.86 | 0.503 | 67 | Left Foot | Open Play | Blocked Shot |
Gabriel Agbonlahor | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 487150 | 71 | 890 | 548651 | Micah Richards | 0.054676 | 0.795 | 0.507 | 81 | Left Foot | Open Play | Saved Shot | ||||
Idrissa Gueye | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 487136 | 71 | 668 | 548653 | Jordan Veretout | 0.018269 | 0.742 | 0.598 | 26 | Right Foot | Open Play | Missed Shot | ||||
Idrissa Gueye | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 487144 | 71 | 668 | 548657 | Gabriel Agbonlahor | 0.113668 | 0.883 | 0.561 | 47 | Left Foot | Open Play | Saved Shot | ||||
Leandro Bacuna | 1 | 2015 | 82 | 2015-08-08 18:00:00 | 487141 | 71 | 674 | <NA> | <NA> | 0.097371 | 0.786 | 0.437 | 42 | Right Foot | Direct Freekick | Blocked Shot |
[3]:
import soccerdata as sd
WhoScored¶
[4]:
ws = sd.WhoScored(leagues="ENG-Premier League", seasons=2021)
print(ws.__doc__)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:462: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
warnings.warn(msg)
Provides pd.DataFrames from data available at http://whoscored.com.
Data will be downloaded as necessary and cached locally in
``~/soccerdata/data/WhoScored``.
Parameters
----------
leagues : string or iterable, optional
IDs of Leagues to include.
seasons : string, int or list, optional
Seasons to include. Supports multiple formats.
Examples: '16-17'; 2016; '2016-17'; [14, 15, 16]
proxy : 'tor' or dict or list(dict) or callable, optional
Use a proxy to hide your IP address. Valid options are:
- "tor": Uses the Tor network. Tor should be running in
the background on port 9050.
- dict: A dictionary with the proxy to use. The dict should be
a mapping of supported protocols to proxy addresses. For example::
{
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
- list(dict): A list of proxies to choose from. A different proxy will
be selected from this list after failed requests, allowing rotating
proxies.
- callable: A function that returns a valid proxy. This function will
be called after failed requests, allowing rotating proxies.
no_cache : bool
If True, will not use cached data.
no_store : bool
If True, will not store downloaded data.
data_dir : Path
Path to directory where data will be cached.
path_to_browser : Path, optional
Path to the Chrome executable.
headless : bool, default: True
If True, will run Chrome in headless mode. Setting this to False might
help to avoid getting blocked.
Game schedule¶
[5]:
epl_schedule = ws.read_schedule()
epl_schedule.head()
[5]:
game_id | home_team | away_team | date | url | stage | |||
---|---|---|---|---|---|---|---|---|
league | season | game | ||||||
ENG-Premier League | 2021 | 2020-09-12 Crystal Palace-Southampton | 1485186 | Crystal Palace | Southampton | 2020-09-12 15:00:00 | https://www.whoscored.com/Matches/1485186/Live... | NaN |
2020-09-12 Fulham-Arsenal | 1485187 | Fulham | Arsenal | 2020-09-12 12:30:00 | https://www.whoscored.com/Matches/1485187/Live... | NaN | ||
2020-09-12 Liverpool-Leeds United | 1485188 | Liverpool | Leeds United | 2020-09-12 17:30:00 | https://www.whoscored.com/Matches/1485188/Live... | NaN | ||
2020-09-12 West Ham United-Newcastle United | 1485191 | West Ham United | Newcastle United | 2020-09-12 20:00:00 | https://www.whoscored.com/Matches/1485191/Live... | NaN | ||
2020-09-13 Tottenham-Everton | 1485189 | Tottenham | Everton | 2020-09-13 16:30:00 | https://www.whoscored.com/Matches/1485189/Live... | NaN |
Injured and suspended players¶
[6]:
missing_players = ws.read_missing_players(match_id=1485184)
missing_players.head()
[6]:
game_id | player_id | reason | status | |||||
---|---|---|---|---|---|---|---|---|
league | season | game | team | player | ||||
ENG-Premier League | 2021 | 2021-01-12 Burnley-Manchester United | Burnley | Charlie Taylor | 1485184 | 107462 | injured doubtful | Doubtful |
Dwight McNeil | 1485184 | 357427 | injured doubtful | Doubtful | ||||
Jay Rodriguez | 1485184 | 33891 | injured doubtful | Doubtful | ||||
Jimmy Dunne | 1485184 | 366743 | injured doubtful | Doubtful | ||||
Manchester United | Eric Bailly | 1485184 | 243814 | injured doubtful | Doubtful |
Match event stream data¶
[7]:
events = ws.read_events(match_id=1485184)
events.head()
[7]:
period | minute | expanded_minute | type | outcome_type | team | player | qualifiers | x | y | end_x | end_y | goal_mouth_y | goal_mouth_z | is_touch | is_shot | is_goal | related_event_id | related_player_id | blocked_x | blocked_y | card_type | game_id | team_id | player_id | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | id | |||||||||||||||||||||||||
ENG-Premier League | 2021 | 2021-01-12 Burnley-Manchester United | 2253458317 | PreMatch | 0 | 0 | FormationSet | Successful | Burnley | NaN | [{'type': {'displayName': 'TeamPlayerFormation... | 0.0 | 0.0 | NaN | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1485184 | 184 | NaN |
2253458375 | PreMatch | 0 | 0 | FormationSet | Successful | Man Utd | NaN | [{'type': {'displayName': 'CaptainPlayerId', '... | 0.0 | 0.0 | NaN | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1485184 | 32 | NaN | |||
2253487469 | FirstHalf | 0 | 0 | Start | Successful | Burnley | NaN | [] | 0.0 | 0.0 | NaN | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1485184 | 184 | NaN | |||
2253487473 | FirstHalf | 0 | 0 | Start | Successful | Man Utd | NaN | [] | 0.0 | 0.0 | NaN | NaN | NaN | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1485184 | 32 | NaN | |||
2253487625 | FirstHalf | 0 | 0 | Pass | Successful | Burnley | Ashley Westwood | [{'type': {'displayName': 'Angle', 'value': 21... | 50.3 | 50.3 | 30.5 | 50.3 | NaN | NaN | True | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1485184 | 184 | 79050.0 |
Match event stream data can be returned in various formats, which can be selected with the “output_fmt” parameter.
events
(default): Returns a dataframe with all events.raw
: Returns the original unformatted WhoScored JSON.spadl
: Returns a dataframe with the SPADL representation of the original events.atomic-spadl
: Returns a dataframe with the Atomic-SPADL representation of the original events.loader
: Returns asocceration.data.opta.OptaLoader
instance
[12]:
events = ws.read_events(match_id=1485184, output_fmt="raw")
import json
print(json.dumps(events[1485184][0], indent=2))
{
"eventId": 2,
"expandedMinute": 0,
"id": 2253487473,
"isTouch": false,
"minute": 0,
"outcomeType": {
"displayName": "Successful",
"value": 1
},
"period": {
"displayName": "FirstHalf",
"value": 1
},
"qualifiers": [],
"satisfiedEventsTypes": [],
"second": 0,
"teamId": 32,
"type": {
"displayName": "Start",
"value": 32
},
"x": 0,
"y": 0
}
[13]:
actions = ws.read_events(match_id=1485184, output_fmt='spadl')
actions.head()
[13]:
game_id | original_event_id | period_id | time_seconds | team_id | player_id | start_x | end_x | start_y | end_y | type_id | result_id | bodypart_id | action_id | player | team | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1485184 | 2253487625 | 1 | 0.0 | 184 | 79050.0 | 52.815 | 32.025 | 34.204 | 34.204 | 0 | 1 | 0 | 0 | Ashley Westwood | Burnley |
1 | 1485184 | 2253487639 | 1 | 2.0 | 184 | 131464.0 | 31.080 | 38.220 | 36.312 | 15.844 | 0 | 1 | 0 | 1 | James Tarkowski | Burnley |
2 | 1485184 | NaN | 1 | 4.5 | 184 | 80067.0 | 38.220 | 43.365 | 15.844 | 12.512 | 21 | 1 | 0 | 2 | Matthew Lowton | Burnley |
3 | 1485184 | 2253487685 | 1 | 7.0 | 184 | 80067.0 | 43.365 | 90.300 | 12.512 | 49.708 | 0 | 1 | 0 | 3 | Matthew Lowton | Burnley |
4 | 1485184 | 2253487689 | 1 | 11.0 | 184 | 93473.0 | 90.300 | 105.000 | 49.708 | 38.828 | 11 | 0 | 0 | 4 | Robbie Brady | Burnley |
[14]:
atomic_actions = ws.read_events(match_id=1485184, output_fmt='atomic-spadl')
atomic_actions.head()
[14]:
game_id | original_event_id | action_id | period_id | time_seconds | team_id | player_id | x | y | dx | dy | type_id | bodypart_id | player | team | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1485184 | 2253487625 | 0 | 1 | 0.00 | 184 | 79050.0 | 52.815 | 34.204 | -20.790 | 0.000 | 0 | 0 | Ashley Westwood | Burnley |
1 | 1485184 | 2253487625 | 1 | 1 | 1.00 | 184 | 131464.0 | 32.025 | 34.204 | 0.000 | 0.000 | 23 | 0 | James Tarkowski | Burnley |
2 | 1485184 | 2253487639 | 2 | 1 | 2.00 | 184 | 131464.0 | 31.080 | 36.312 | 7.140 | -20.468 | 0 | 0 | James Tarkowski | Burnley |
3 | 1485184 | 2253487639 | 3 | 1 | 3.25 | 184 | 80067.0 | 38.220 | 15.844 | 0.000 | 0.000 | 23 | 0 | Matthew Lowton | Burnley |
4 | 1485184 | NaN | 4 | 1 | 4.50 | 184 | 80067.0 | 38.220 | 15.844 | 5.145 | -3.332 | 21 | 0 | Matthew Lowton | Burnley |
[15]:
# Scrape all games and return a socceration.data.opta.OptaLoader
loader = ws.read_events(output_fmt='loader')
# Now use this loader to load the data
print("Games:")
df_games = loader.games(competition_id="ENG-Premier League", season_id="2021")
display(df_games.head())
print("Teams:")
df_teams = loader.teams(game_id=1485184)
display(df_teams.head())
print("Players:")
df_players = loader.players(game_id=1485184)
display(df_players.head())
print("Events:")
df_events = loader.events(game_id=1485184)
display(df_events.head())
# You can use the socceraction package to convert the events
# to SPADL and to compute xT or VAEP action values
Games:
game_id | season_id | competition_id | game_day | game_date | home_team_id | away_team_id | home_score | away_score | duration | referee | venue | attendance | home_manager | away_manager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1485494 | 2021 | ENG-Premier League | None | 2021-04-04 12:00:00 | 18 | 184 | 3 | 2 | 98 | Andre Marriner | St. Mary's Stadium | 0 | Ralph Hasenhüttl | Sean Dyche |
1 | 1485300 | 2021 | ENG-Premier League | None | 2020-12-16 20:00:00 | 170 | 211 | 0 | 0 | 95 | Robert Jones | Craven Cottage | 0 | Scott Parker | Graham Potter |
2 | 1485264 | 2021 | ENG-Premier League | None | 2020-12-06 19:15:00 | 26 | 161 | 4 | 0 | 97 | Craig Pawson | Anfield | 2000 | Jürgen Klopp | Nuno Espírito Santo |
3 | 1485519 | 2021 | ENG-Premier League | None | 2021-05-16 16:30:00 | 175 | 26 | 1 | 2 | 102 | Mike Dean | The Hawthorns | 0 | Sam Allardyce | Jürgen Klopp |
4 | 1485436 | 2021 | ENG-Premier League | None | 2021-03-19 20:00:00 | 170 | 19 | 1 | 2 | 100 | David Coote | Craven Cottage | 0 | Scott Parker | Marcelo Bielsa |
Teams:
team_id | team_name | |
---|---|---|
0 | 184 | Burnley |
1 | 32 | Man Utd |
Players:
game_id | team_id | player_id | player_name | is_starter | minutes_played | jersey_number | starting_position | |
---|---|---|---|---|---|---|---|---|
0 | 1485184 | 184 | 105720 | Nick Pope | True | 102 | 1 | GK |
1 | 1485184 | 184 | 80067 | Matthew Lowton | True | 102 | 2 | DR |
2 | 1485184 | 184 | 94935 | Ben Mee | True | 102 | 6 | DC |
3 | 1485184 | 184 | 131464 | James Tarkowski | True | 102 | 5 | DC |
4 | 1485184 | 184 | 24148 | Erik Pieters | True | 102 | 23 | DL |
Events:
game_id | event_id | period_id | team_id | player_id | type_id | timestamp | minute | second | outcome | start_x | start_y | end_x | end_y | qualifiers | related_player_id | touch | shot | goal | type_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1485184 | 2253487473 | 1 | 32 | NaN | 32 | 2021-01-12 20:15:00 | 0 | 0 | True | 0.0 | 0.0 | 0.0 | 0.0 | {} | NaN | False | False | False | start |
1 | 1485184 | 2253487469 | 1 | 184 | NaN | 32 | 2021-01-12 20:15:00 | 0 | 0 | True | 0.0 | 0.0 | 0.0 | 0.0 | {} | NaN | False | False | False | start |
2 | 1485184 | 2253487625 | 1 | 184 | 79050.0 | 1 | 2021-01-12 20:15:00 | 0 | 0 | True | 50.3 | 50.3 | 30.5 | 50.3 | {213: '3.1', 178: True, 141: '50.3', 212: '20.... | NaN | True | False | False | pass |
3 | 1485184 | 2253487639 | 1 | 184 | 131464.0 | 1 | 2021-01-12 20:15:02 | 0 | 2 | True | 29.6 | 53.4 | 36.4 | 23.3 | {178: True, 213: '5.0', 212: '21.7', 141: '23.... | NaN | True | False | False | pass |
4 | 1485184 | 2253487685 | 1 | 184 | 80067.0 | 1 | 2021-01-12 20:15:07 | 0 | 7 | True | 41.3 | 18.4 | 86.0 | 73.1 | {1: True, 213: '0.7', 56: 'Center', 178: True,... | NaN | True | False | False | pass |
[ ]:
How-to Guides¶
Here you’ll find short answers to “How do I…?” types of questions. These how-to guides don’t cover topics in depth – you’ll find that material in the API Reference. However, these guides will help you quickly accomplish common tasks.
How to add custom leagues¶
SoccerData has built-in support to scrape data from the top-5 European leagues
and the major international tournaments. The leagues available for each source
can be listed with the available_leagues()
class method.
import soccerdata as sd
sd.FBref.available_leagues()
>>> ['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'ITA-Serie A']
This documentation explains how to add custom leagues.
Warning
Note that you might encounter errors when trying to scrape data for the leagues you added yourself. This is because the data provided for these leagues might have a different structure. If you encounter such an error, please do not open an issue on GitHub, but try to fix it yourself.
Adding a new league¶
Additional leagues can configured in SOCCERDATA_DIR/config/league_dict.json
.
This file should contain a mapping between a generic name for the league and
the identifier used internally by each data source (see below) that you want
to support. For example, for the Dutch Eredivisie this would be:
{
"NED-Eredivisie": {
"ClubElo": "NED_1",
"MatchHistory": "N1",
"SoFIFA": "[Netherlands] Eredivisie",
"FBref": "Eredivisie",
"ESPN": "ned.1",
"FiveThirtyEight": "eredivisie",
"WhoScored": "Netherlands - Eredivisie",
"Sofascore": "Eredivisie",
"season_start": "Aug",
"season_end": "May"
}
}
The season_end
and season_start
fields are optional. This should be
the month in which the last game and first game of a season are played,
respectively. If they are not provided, June is used as the last month of the
season and July as the first one.
Now, restart your Python session and check whether it is added to available leagues by running the command below.
>>> import soccerdata as sd
>>> sd.FBref.available_leagues()
[..., 'NED-Eredivisie', ...]
Internal identifiers¶
Below are instructions on how to find the internal identifiers for each data source.
- ClubElo
The internal identifier has the format
{country_code}_{level}
. The get the country code, go to https://clubelo.com/, click on the league you want to add and take the three-letter code in the URL. For example, the URL for the Dutch Eredivisie is http://clubelo.com/NED which means that the country identifier isNED
. The level is the number of the league, starting with 1 for the top league. The internal identifier for the Dutch Eredivisie is thereforeNED_1
.- MatchHistory
The internal identifier has the format
{country_code}{level}
. Download the CSV file corresponding corresponding to the league you would like to add from https://www.football-data.co.uk/data.php and take the value in theDiv
column.- SoFIFA
The internal identifier has the format
[{region}] {league name}
. Go to https://sofifa.com/api/league to get the list of available leagues. The{region}
corresponds to thenationName
field in the JSON response. The{league name}
corresponds to thevalue
field.- FBref
Go to https://fbref.com/en/comps/ and take the value in the
Competition Name
column.- ESPN
The internal identifier has the format
{country_code}.{level}
. Go to https://www.espn.com/soccer/competitions, click on the league you want to add and take the value in the URL after/league/_/name/
.- FiveThirtyEight
Go to https://projects.fivethirtyeight.com/soccer-predictions/, select the relevant league and take the value in the URL after
/soccer-predictions/
.- WhoScored
Go to https://www.whoscored.com and use the JavaScript console to get the value of the
allRegions
variable. The internal identifier has the format{region name} - {league name}
.- FotMob
The internal identifier is identical in style to the general format:
[{region}]-{league name}
. Go to https://www.fotmob.com/api/allLeagues to get the list of available leagues. The{region}
corresponds to theccode
field in the JSON response. The{league name}
corresponds to thename
field.- Sofascore
Go to https://api.sofascore.com/api/v1/config/unique-tournaments/EN/football to get the list of major leagues and tournaments. Access
uniqueTournaments
in the JSON response, and the{league name}
corresponds to thename
field.
Troubleshooting¶
If you add a new league and it doesn’t show up in the list of available leagues, there are a few things you can do to debug the problem.
Make sure to reload the soccerdata module after you modify the
league_dict.json
file. The most straightforward way to do this is to restart your notebook or Python interpreter.Check whether your
league_dict.json
file is at the correct location. If so, you should see this appear in the log messages when importing the soccerdata library.>>> import soccerdata as sd [11/25/22 11:49:12] INFO Custom team name replacements loaded from <path>/teamname_replacements.json. _config.py:83 INFO Custom league dict loaded from <path>/league_dict.json. _config.py:153
Check whether the content of your
league_dict.json
file is valid JSON. You can check the file’s syntax using Python’s built-injson.tool
module.$ cat config/league_dict.json | python -m json.tool Expecting ',' delimiter: line 1 column 10 (char 9)
How to use a proxy server¶
You can setup a SOCKS5 proxy with Tor.
Checkout the installation guide on the Tor website for installation
instructions. After installing Tor, make sure to start it up before scraping.
This can easily be done by running the tor
command from your terminal (in
a separate window), Tor will start up and run on “localhost:9050” by default.
Once Tor is running, you can enable the extension by setting proxy='tor'
.
ws = sd.WhoScored(proxy='tor')
The code snippet above assumes you have a Tor proxy running on “localhost:9050”. Many distributions indeed default to having a SOCKS proxy listening on port 9050, but some may not. In particular, the Tor Browser Bundle defaults to listening on port 9150. You can specify a custom host and port as
ws = sd.WhoScored(proxy={
"http": "socks5://127.0.0.1:9150",
"https": "socks5://127.0.0.1:9150",
})
Data Analysis Examples¶
Below is a gallery of examples of data analysis using the soccerdata
package.
If you have an example you would like to share, please submit a pull request to the
SoccerData GitHub repository.
You can download the notebook,
[2]:
import soccerdata as sd
[3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_context("notebook")
sns.set_style("whitegrid")
Evolution of top team’s Elo ratings¶
How did the current top 5 teams in the world develop over time?
[4]:
elo = sd.ClubElo()
current_elo = elo.read_by_date()
current_elo.head()
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
return pd.read_csv(
[4]:
rank | country | level | elo | from | to | league | |
---|---|---|---|---|---|---|---|
team | |||||||
Man City | 1.0 | ENG | 1 | 2079.984619 | 2023-06-11 | 2023-08-11 | ENG-Premier League |
Liverpool | 2.0 | ENG | 1 | 1949.932861 | 2023-06-11 | 2023-08-13 | ENG-Premier League |
Bayern | 3.0 | GER | 1 | 1937.012451 | 2023-05-28 | 2023-08-18 | GER-Bundesliga |
Arsenal | 4.0 | ENG | 1 | 1928.216187 | 2023-06-11 | 2023-08-12 | ENG-Premier League |
Real Madrid | 5.0 | ESP | 1 | 1907.589233 | 2023-06-08 | 2023-08-12 | ESP-La Liga |
[5]:
num_teams = 5
smoothing = 100
elo_top_development = pd.concat(
[elo.read_team_history(team)['elo'].rolling(smoothing).mean()
for team in current_elo.reset_index()['team'][:num_teams]
],
axis=1)
elo_top_development.columns = current_elo.reset_index()['team'][:num_teams]
elo_top_development.fillna(method='ffill')
fig = plt.figure(figsize=(16, 10))
ax1 = fig.add_subplot(111, ylabel='ELO rolling avg.', xlabel='Date')
elo_top_development.plot(ax=ax1)
ax1.legend(loc='upper left', frameon=False, bbox_to_anchor=(0, 1.05), ncol=num_teams)
sns.despine();
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
return pd.read_csv(
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/clubelo.py:18: FutureWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
return pd.read_csv(

You can download the notebook,
[2]:
import soccerdata as sd
[3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_context("notebook")
sns.set_style("whitegrid")
Home team advantage in the Italian Serie A¶
We all know sports teams have an advantage when playing at home. Here’s a look at home team advantage for 5 years of the Serie A.
[4]:
seriea_hist = sd.MatchHistory('ITA-Serie A', range(2018, 2023))
games = seriea_hist.read_games()
games.sample(5)
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/soccerdata/_common.py:471: UserWarning: Season id "2021" is ambiguous: interpreting as "20-21"
warnings.warn(msg, stacklevel=1)
[4]:
date | home_team | away_team | FTHG | FTAG | FTR | HTHG | HTAG | HTR | HS | ... | AvgC<2.5 | AHCh | B365CAHH | B365CAHA | PCAHH | PCAHA | MaxCAHH | MaxCAHA | AvgCAHH | AvgCAHA | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
league | season | game | |||||||||||||||||||||
ITA-Serie A | 1819 | 2018-12-22 Lazio-Cagliari | 2018-12-22 12:00:00 | Lazio | Cagliari | 3 | 1 | H | 2 | 0 | H | 15 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2021 | 2020-12-20 Torino-Bologna | 2020-12-20 11:30:00 | Torino | Bologna | 1 | 1 | D | 0 | 0 | D | 13 | ... | 2.25 | 0.0 | 1.91 | 2.02 | 1.92 | 2.0 | 1.94 | 2.11 | 1.86 | 2.00 | |
1819 | 2019-03-03 Udinese-Bologna | 2019-03-03 12:00:00 | Udinese | Bologna | 2 | 1 | H | 1 | 1 | D | 11 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
2021 | 2021-02-20 Genoa-Verona | 2021-02-20 17:00:00 | Genoa | Verona | 2 | 2 | D | 0 | 1 | A | 12 | ... | 1.64 | 0.0 | 1.95 | 1.98 | 1.93 | 2.0 | 2.07 | 2.00 | 1.93 | 1.93 | |
1819 | 2019-03-10 Fiorentina-Lazio | 2019-03-10 12:00:00 | Fiorentina | Lazio | 1 | 1 | D | 0 | 1 | A | 7 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
[5]:
def home_away_results(games: pd.DataFrame):
"""Returns aggregated home/away results per team"""
res = (
pd.melt(games.reset_index(),
id_vars=['date', 'FTR'],
value_name='team', var_name='is_home',
value_vars=['home_team', 'away_team']))
res.is_home = res.is_home.replace(['home_team', 'away_team'], ['Home', 'Away'])
res['win'] = res['lose'] = res['draw'] = 0
res.loc[(res['is_home'] == 'Home') & (res['FTR'] == 'H'), 'win'] = 1
res.loc[(res['is_home'] == 'Away') & (res['FTR'] == 'A'), 'win'] = 1
res.loc[(res['is_home'] == 'Home') & (res['FTR'] == 'A'), 'lose'] = 1
res.loc[(res['is_home'] == 'Away') & (res['FTR'] == 'H'), 'lose'] = 1
res.loc[res['FTR'] == 'D', 'draw'] = 1
g = res.groupby(['team', 'is_home'])
w = (g.win.agg(['sum', 'mean'])
.rename(columns={'sum': 'n_win', 'mean': 'win_pct'}))
l = (g.lose.agg(['sum', 'mean'])
.rename(columns={'sum': 'n_lose', 'mean': 'lose_pct'}))
d = (g.draw.agg(['sum', 'mean'])
.rename(columns={'sum': 'n_draw', 'mean': 'draw_pct'}))
res = pd.concat([w, l, d], axis=1)
return res
[6]:
results = home_away_results(games)
results.head(6)
[6]:
n_win | win_pct | n_lose | lose_pct | n_draw | draw_pct | ||
---|---|---|---|---|---|---|---|
team | is_home | ||||||
Atalanta | Away | 52 | 0.547368 | 16 | 0.168421 | 27 | 0.284211 |
Home | 56 | 0.589474 | 23 | 0.242105 | 16 | 0.168421 | |
Benevento | Away | 10 | 0.263158 | 18 | 0.473684 | 10 | 0.263158 |
Home | 4 | 0.105263 | 20 | 0.526316 | 14 | 0.368421 | |
Bologna | Away | 22 | 0.231579 | 48 | 0.505263 | 25 | 0.263158 |
Home | 35 | 0.368421 | 29 | 0.305263 | 31 | 0.326316 |
The overall picture shows most teams have a clear advantage at home:
[7]:
g = sns.FacetGrid(results.reset_index(), hue='team', palette='Set2', height=6, aspect=.5)
g.map(sns.pointplot, 'is_home', 'win_pct', order=["Away", "Home"])
g.set_axis_labels('', 'win %');
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/.venv/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

But there are a few exceptions.
[8]:
g = sns.FacetGrid(results.reset_index(), col='team', col_wrap=5)
g.map(sns.pointplot, 'is_home', 'win_pct', order=["Away", "Home"])
g.set_axis_labels('', 'win %');
/cw/dtaijupiter/NoCsBack/dtai/pieterr/Projects/soccerdata/.venv/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

API Reference¶
This part of the documentation covers all the interfaces of the implemented data scrapers.
Club Elo¶
- class soccerdata.ClubElo(proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/ClubElo'))¶
Provides pd.DataFrames from CSV API at http://api.clubelo.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/ClubElo
.Since the source does not provide league names, this class will not filter by league. League names will be inserted from the other sources where available. Leagues that are only covered by clubelo.com will have NaN values.
- Parameters:
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- read_by_date(date=None)¶
Retrieve ELO scores for all teams at specified date.
Elo scores are available as early as 1939. Values before 1960 should be considered provisional.
- Parameters:
date (datetime object or string like 'YYYY-MM-DD') – Date for which to retrieve ELO scores. If no date is specified, get today’s scores.
- Raises:
TypeError – If date is not a date string or datetime object.
ValueError – If data is an invalid date string.
- Return type:
pd.DataFrame
- read_team_history(team, max_age=1)¶
Retrieve full ELO history for one club.
For the exact spelling of a club’s name, check the result of
read_by_date()
or clubelo.com. You can also use alternative team names specified in teamname_replacements.json. Values before 1960 should be considered provisional.- Parameters:
team (str) – The club’s name.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
ValueError – If no ratings for the given team are available.
- Return type:
pd.DataFrame
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- get(url, filepath=None, max_age=None, no_cache=False, var=None)¶
Load data from url.
By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.
- Parameters:
url (str) – URL to download.
filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
no_cache (bool) – If True, will not use cached data. Overrides the class property.
var (str or list of str, optional) – Return a JavaScript variable instead of the page source.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Returns:
File-like object of downloaded data.
- Return type:
io.BufferedIOBase
- property leagues: List[str]¶
Return a list of selected leagues.
- property seasons: List[str]¶
Return a list of selected seasons.
ESPN¶
- class soccerdata.ESPN(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/ESPN'))¶
Provides pd.DataFrames from JSON api available at http://site.api.espn.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/ESPN
.- Parameters:
leagues (string or iterable, optional) – IDs of leagues to include.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- property seasons: List[str]¶
Return a list of selected seasons.
- read_schedule(force_cache=False)¶
Retrieve the game schedule for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_matchsheet(match_id=None)¶
Retrieve match sheets for the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the match sheet for a specific game.
- Raises:
ValueError – If no games with the given IDs were found for the selected seasons and leagues.
- Return type:
pd.DataFrame.
- read_lineup(match_id=None)¶
Retrieve lineups for the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the lineup for a specific game.
- Raises:
ValueError – If no games with the given IDs were found for the selected seasons and leagues.
- Return type:
pd.DataFrame.
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- get(url, filepath=None, max_age=None, no_cache=False, var=None)¶
Load data from url.
By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.
- Parameters:
url (str) – URL to download.
filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
no_cache (bool) – If True, will not use cached data. Overrides the class property.
var (str or list of str, optional) – Return a JavaScript variable instead of the page source.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Returns:
File-like object of downloaded data.
- Return type:
io.BufferedIOBase
- property leagues: List[str]¶
Return a list of selected leagues.
FBref¶
- class soccerdata.FBref(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/FBref'))¶
Provides pd.DataFrames from data at http://fbref.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/FBref
.- Parameters:
leagues (string or iterable, optional) – IDs of leagues to include. For efficiently reading data from the Top-5 European leagues, use “Big 5 European Leagues Combined”.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- read_leagues(split_up_big5=False)¶
Retrieve selected leagues from the datasource.
- Parameters:
split_up_big5 (bool) – If True, it will load the “Big 5 European Leagues Combined” instead of each league individually.
- Return type:
pd.DataFrame
- read_seasons(split_up_big5=False)¶
Retrieve the selected seasons for the selected leagues.
- Parameters:
split_up_big5 (bool) – If True, it will load the “Big 5 European Leagues Combined” instead of each league individually.
- Return type:
pd.DataFrame
- read_team_season_stats(stat_type='standard', opponent_stats=False)¶
Retrieve aggregated season stats for all teams in the selected leagues and seasons.
- The following stat types are available:
‘standard’
‘keeper’
‘keeper_adv’
‘shooting’
‘passing’
‘passing_types’
‘goal_shot_creation’
‘defense’
‘possession’
‘playing_time’
‘misc’
- Parameters:
stat_type (str) – Type of stats to retrieve.
opponent_stats (bool) – If True, will retrieve opponent stats.
- Raises:
ValueError – If
stat_type
is not valid.- Return type:
pd.DataFrame
- read_team_match_stats(stat_type='schedule', opponent_stats=False, team=None, force_cache=False)¶
Retrieve the match logs for all teams in the selected leagues and seasons.
- The following stat types are available:
‘schedule’
‘keeper’
‘shooting’
‘passing’
‘passing_types’
‘goal_shot_creation’
‘defense’
‘possession’
‘misc’
- Parameters:
stat_type (str) – Type of stats to retrieve.
opponent_stats (bool) – If True, will retrieve opponent stats.
team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If
stat_type
is not valid.- Return type:
pd.DataFrame
- read_player_season_stats(stat_type='standard')¶
Retrieve players from the datasource for the selected leagues and seasons.
- The following stat types are available:
‘standard’
‘shooting’
‘passing’
‘passing_types’
‘goal_shot_creation’
‘defense’
‘possession’
‘playing_time’
‘misc’
‘keeper’
‘keeper_adv’
- Parameters:
stat_type (str) – Type of stats to retrieve.
- Raises:
TypeError – If
stat_type
is not valid.- Return type:
pd.DataFrame
- read_schedule(force_cache=False)¶
Retrieve the game schedule for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_player_match_stats(stat_type='summary', match_id=None, force_cache=False)¶
Retrieve the match stats for the selected leagues and seasons.
- The following stat types are available:
‘summary’
‘keepers’
‘passing’
‘passing_types’
‘defense’
‘possession’
‘misc’
- Parameters:
stat_type (str) – Type of stats to retrieve.
match_id (int or list of int, optional) – Retrieve the event stream for a specific game.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If no games with the given IDs were found for the selected seasons and leagues.
TypeError – If
stat_type
is not valid.
- Return type:
pd.DataFrame
- read_lineup(match_id=None, force_cache=False)¶
Retrieve lineups for the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the lineup for a specific game.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If no games with the given IDs were found for the selected seasons and leagues.
- Return type:
pd.DataFrame.
- read_events(match_id=None, force_cache=False)¶
Retrieve match events for the selected seasons or selected matches.
The data returned includes the timing of goals, cards and substitutions. Also includes the players who are involved in the event.
- Parameters:
match_id (int or list of int, optional) – Retrieve the events for a specific game.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If no games with the given IDs were found for the selected seasons and leagues.
- Return type:
pd.DataFrame.
- read_shot_events(match_id=None, force_cache=False)¶
Retrieve shooting data for the selected seasons or selected matches.
The data returned includes who took the shot, when, with which body part and from how far away. Additionally, the player creating the chance and also the creation before this are included in the data.
- Parameters:
match_id (int or list of int, optional) – Retrieve the shots for a specific game.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If no games with the given IDs were found for the selected seasons and leagues.
- Return type:
pd.DataFrame.
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
FiveThirtyEight¶
- class soccerdata.FiveThirtyEight(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/FiveThirtyEight'))¶
Provides pd.DataFrames from FiveThirtyEight’s “Club Soccer Predictions” project.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/FiveThirtyEight
.Original project and background info: https://projects.fivethirtyeight.com/soccer-predictions/ and https://fivethirtyeight.com/features/how-our-club-soccer-projections-work/
- Parameters:
leagues (string or iterable, optional) – IDs of leagues to include.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- read_games()¶
Retrieve all games for the selected leagues.
- Return type:
pd.DataFrame
- read_forecasts()¶
Retrieve the forecasted results for the selected leagues.
- Return type:
pd.DataFrame
- read_clinches()¶
Retrieve clinches for the selected leagues.
- Return type:
pd.DataFrame
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
FotMob¶
- class soccerdata.FotMob(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/FotMob'))¶
Provides pd.DataFrames from data available at http://www.fotmob.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/FotMob
.- Parameters:
leagues (string or iterable, optional) – IDs of Leagues to include.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
’tor’: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- read_leagues()¶
Retrieve the selected leagues from the datasource.
- Return type:
pd.DataFrame
- read_seasons()¶
Retrieve the selected seasons for the selected leagues.
- Return type:
pd.DataFrame
- read_league_table(force_cache=False)¶
Retrieve the league table for the selected leagues.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_schedule(force_cache=False)¶
Retrieve the game schedule for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_team_match_stats(stat_type='Top stats', opponent_stats=True, team=None, force_cache=False)¶
Retrieve the match stats for the selected leagues and seasons.
- The following stat types are available:
‘Top stats’
‘Shots’
‘Expected goals (xG)’
‘Passes’
‘Defence’
‘Duels’
‘Discipline’
- Parameters:
stat_type (str) – Type of stats to retrieve.
opponent_stats (bool) – If True, will retrieve opponent stats.
team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
TypeError – If
stat_type
is not valid.ValueError – If no games with the given IDs were found for the selected seasons and leagues.
- Return type:
pd.DataFrame
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
MatchHistory¶
- class soccerdata.MatchHistory(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/MatchHistory'))¶
Provides pd.DataFrames from CSV files available at http://www.football-data.co.uk/data.php.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/MatchHistory
.- Parameters:
leagues (string or iterable) – IDs of leagues to include.
seasons (string, int or list) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path, optional) – Path to directory where data will be cached.
- property seasons: List[str]¶
Return a list of selected seasons.
- read_games()¶
Retrieve game history for the selected leagues and seasons.
Column names are explained here: http://www.football-data.co.uk/notes.txt
- Return type:
pd.DataFrame
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- get(url, filepath=None, max_age=None, no_cache=False, var=None)¶
Load data from url.
By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.
- Parameters:
url (str) – URL to download.
filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
no_cache (bool) – If True, will not use cached data. Overrides the class property.
var (str or list of str, optional) – Return a JavaScript variable instead of the page source.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Returns:
File-like object of downloaded data.
- Return type:
io.BufferedIOBase
- property leagues: List[str]¶
Return a list of selected leagues.
Sofascore¶
- class soccerdata.Sofascore(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/Sofascore'))¶
Provides pd.DataFrames from data available at http://www.sofascore.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/Sofascore
.- Parameters:
leagues (string or iterable, optional) – IDs of Leagues to include.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
’tor’: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- read_leagues()¶
Retrieve the selected leagues from the datasource.
- Return type:
pd.DataFrame
- read_seasons()¶
Retrieve the selected seasons for the selected leagues.
- Return type:
pd.DataFrame
- read_league_table(force_cache=False)¶
Retrieve the league table for the selected leagues.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_schedule(force_cache=False)¶
Retrieve the game schedule for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
SoFIFA¶
- class soccerdata.SoFIFA(leagues=None, versions='latest', proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/SoFIFA'))¶
Provides pd.DataFrames from data at http://sofifa.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/SoFIFA
.- Parameters:
leagues (string or iterable, optional) – IDs of leagues to include.
versions (string, int or list of int, optional) – FIFA releases to include. Should be specified by their ID used in the URL (e.g., 230034). Alternatively, the string “all” can be used to include all versions and “latest” to include the latest version only. Defaults to “latest”.
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
- read_leagues()¶
Retrieve selected leagues from the datasource.
- Return type:
pd.DataFrame
- read_versions(max_age=1)¶
Retrieve available FIFA releases and rating updates.
- Parameters:
max_age (int for age in days, or timedelta object) – The max. age of the locally cached release history before a new version is downloaded.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Return type:
pd.DataFrame
- read_teams()¶
Retrieve all teams for the selected leagues.
- Return type:
pd.DataFrame
- read_players(team=None)¶
Retrieve all players for the selected leagues.
- Parameters:
team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.
- Raises:
ValueError – If no data is found for the given team(s) in the selected leagues.
- Return type:
pd.DataFrame
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- read_team_ratings()¶
Retrieve ratings for all teams in the selected leagues.
- Return type:
pd.DataFrame
- read_player_ratings(team=None, player=None)¶
Retrieve ratings for players.
- Parameters:
team (str or list of str, optional) – Team(s) to retrieve. If None, will retrieve all teams.
player (int or list of int, optional) – Player(s) to retrieve. If None, will retrieve all players.
- Return type:
pd.DataFrame
Understat¶
- class soccerdata.Understat(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/Understat'))¶
Provides pd.DataFrames from data at https://understat.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/Understat
.- Parameters:
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
leagues (str | List[str] | None)
seasons (str | int | Iterable[str | int] | None)
- property seasons: List[str]¶
Return a list of selected seasons.
- read_leagues()¶
Retrieve the selected leagues from the datasource.
- Return type:
pd.DataFrame
- read_seasons()¶
Retrieve the selected seasons from the datasource.
- Return type:
pd.DataFrame
- read_schedule(include_matches_without_data=True, force_cache=False)¶
Retrieve the matches for the selected leagues and seasons.
- Parameters:
include_matches_without_data (bool) – By default matches with and without data are returned. If False, will only return matches with data.
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_team_match_stats(force_cache=False)¶
Retrieve the team match stats for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_player_season_stats(force_cache=False)¶
Retrieve the player season stats for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_player_match_stats(match_id=None)¶
Retrieve the player match stats for the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the player match stats for a specific match.
- Raises:
ValueError – If the given match_id could not be found in the selected seasons.
- Return type:
pd.DataFrame
- read_shot_events(match_id=None)¶
Retrieve the shot events for the selected matches or the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the shot events for a specific match.
- Raises:
ValueError – If the given match_id could not be found in the selected seasons.
- Return type:
pd.DataFrame
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- get(url, filepath=None, max_age=None, no_cache=False, var=None)¶
Load data from url.
By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.
- Parameters:
url (str) – URL to download.
filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
no_cache (bool) – If True, will not use cached data. Overrides the class property.
var (str or list of str, optional) – Return a JavaScript variable instead of the page source.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Returns:
File-like object of downloaded data.
- Return type:
io.BufferedIOBase
- property leagues: List[str]¶
Return a list of selected leagues.
WhoScored¶
- class soccerdata.WhoScored(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/WhoScored'), path_to_browser=None, headless=False)¶
Provides pd.DataFrames from data available at http://whoscored.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/WhoScored
.- Parameters:
leagues (string or iterable, optional) – IDs of Leagues to include.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
path_to_browser (Path, optional) – Path to the Chrome executable.
headless (bool, default: True) – If True, will run Chrome in headless mode. Setting this to False might help to avoid getting blocked. Only supported for Selenium <4.13.
- read_schedule(force_cache=False)¶
Retrieve the game schedule for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_missing_players(match_id=None, force_cache=False)¶
Retrieve a list of injured and suspended players ahead of each game.
- Parameters:
match_id (int or list of int, optional) – Retrieve the missing players for a specific game.
force_cache (bool) – By default no cached data is used to scrapre the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If the given match_id could not be found in the selected seasons.
- Return type:
pd.DataFrame
- read_events(match_id=None, force_cache=False, live=False, output_fmt='events')¶
Retrieve the the event data for each game in the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the event stream for a specific game.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
live (bool) – If True, will not return a cached copy of the event data. This is usefull to scrape live data.
output_fmt (str, default: 'events') –
- The output format of the returned data. Possible values are:
’events’ (default): Returns a dataframe with all events.
’raw’: Returns the original unformatted WhoScored JSON.
’spadl’: Returns a dataframe with the SPADL representation of the original events. See https://socceraction.readthedocs.io/en/latest/documentation/SPADL.html#spadl
’atomic-spadl’: Returns a dataframe with the Atomic-SPADL representation of the original events. See https://socceraction.readthedocs.io/en/latest/documentation/SPADL.html#atomic-spadl
’loader’: Returns a socceraction.data.opta.OptaLoader instance, which can be used to retrieve the actual data. See https://socceraction.readthedocs.io/en/latest/modules/generated/socceraction.data.opta.OptaLoader.html#socceraction.data.opta.OptaLoader # noqa: E501
None: Doesn’t return any data. This is useful to just cache the data without storing the events in memory.
- Raises:
ValueError – If the given match_id could not be found in the selected seasons.
ImportError – If the requested output format is ‘spadl’, ‘atomic-spadl’ or ‘loader’ but the socceraction package is not installed.
- Return type:
See the description of the
output_fmt
parameter.
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
If you would like to extend the functionality of soccerdata, you might also be interested in the following modules:
Base Readers¶
The logic for downloading data from the web is implemented in the base classes that are documented here. The base classes are not intended to be used directly but rather to be subclassed by the specific readers which implement the logic to parse the data.
The BaseRequestsReader
is a wrapper around the requests library
and is used by scrapers that do not require JavaScript to be executed. The
BaseSeleniumReader
is a wrapper around the selenium library and is
used by scrapers that require JavaScript to be executed.
- class soccerdata._common.BaseRequestsReader(leagues=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data'))¶
Base class for readers that use the Python requests module.
- Parameters:
leagues (str | List[str] | None)
proxy (str | Dict[str, str] | List[Dict[str, str]] | Callable[[], Dict[str, str]] | None)
no_cache (bool)
no_store (bool)
data_dir (Path)
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- get(url, filepath=None, max_age=None, no_cache=False, var=None)¶
Load data from url.
By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.
- Parameters:
url (str) – URL to download.
filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
no_cache (bool) – If True, will not use cached data. Overrides the class property.
var (str or list of str, optional) – Return a JavaScript variable instead of the page source.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Returns:
File-like object of downloaded data.
- Return type:
io.BufferedIOBase
- property leagues: List[str]¶
Return a list of selected leagues.
- property seasons: List[str]¶
Return a list of selected seasons.
- class soccerdata._common.BaseSeleniumReader(leagues=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data'), path_to_browser=None, headless=True)¶
Base class for readers that use Selenium.
- Parameters:
leagues (str | List[str] | None)
proxy (str | Dict[str, str] | List[Dict[str, str]] | Callable[[], Dict[str, str]] | None)
no_cache (bool)
no_store (bool)
data_dir (Path)
path_to_browser (Path | None)
headless (bool)
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
List[str]
- get(url, filepath=None, max_age=None, no_cache=False, var=None)¶
Load data from url.
By default, the source of url is downloaded and saved to filepath. If filepath exists, the url is not visited and the cached data is returned.
- Parameters:
url (str) – URL to download.
filepath (Path, optional) – Path to save downloaded file. If None, downloaded data is not cached.
max_age (int for age in days, or timedelta object) – The max. age of locally cached file before re-download.
no_cache (bool) – If True, will not use cached data. Overrides the class property.
var (str or list of str, optional) – Return a JavaScript variable instead of the page source.
- Raises:
TypeError – If max_age is not an integer or timedelta object.
- Returns:
File-like object of downloaded data.
- Return type:
io.BufferedIOBase
- property leagues: List[str]¶
Return a list of selected leagues.
- property seasons: List[str]¶
Return a list of selected seasons.
Utilities¶
- _common.season_code()¶
Convert a string or int to a season code like ‘1718’.
- Parameters:
season (str | int)
- Return type:
str
- _common.make_game_id()¶
Return a game id based on date, home and away team.
- Parameters:
row (Series)
- Return type:
str
- _common.standardize_colnames(cols=None)¶
Convert DataFrame column names to snake case.
- Parameters:
df (DataFrame)
cols (List[str] | None)
- Return type:
DataFrame
- _common.get_proxy()¶
Return a public proxy.
- Return type:
Dict[str, str]
- _common.check_proxy()¶
Check if proxy is working.
- Parameters:
proxy (dict)
- Return type:
bool
FAQ¶
Is web scraping legal?
Even though web scraping is ubiquitous, its legal status remains unclear. That is because whether web scraping is legal will depend on many aspects. It is always best to consult with a lawyer or legal expert to ensure that your web scraping activities are legal and comply with all applicable laws and regulations.
Something doesn’t work
Have you updated to the newest version of soccerdata?
Clear the cache or run your script without caching enabled.
Does the log produce any warnings that sound like they might be related? Maybe the data you are looking for is not available or can not be processed correctly.
Open an issue on GitHub.
Contributor Guide¶
This document lays out guidelines and advice for contributing to this project. If you’re thinking of contributing, please start by reading this document and getting a feel for how contributing to this project works. If you have any questions, feel free to reach out to Pieter Robberechts, the primary maintainer.
The guide is split into sections based on the type of contribution you’re thinking of making.
Bug Reports¶
Bug reports are hugely important! Before you raise one, though, please check through the GitHub issues, both open and closed, to confirm that the bug hasn’t been reported before.
When filing an issue, make sure to answer these questions:
Which Python version are you using?
Which version of soccerdata are you using?
What did you do?
What did you expect to see?
What did you see instead?
The best way to get your bug fixed is to provide a test case, and/or steps to reproduce the issue.
Feature Requests¶
If you believe there is a feature missing, feel free to raise a feature request on the Issue Tracker.
Documentation Contributions¶
Documentation improvements are always welcome! The documentation files live in
the docs/
directory of the codebase. They’re written in
reStructuredText, and use Sphinx to generate the full suite of
documentation.
You do not have to setup a development environment to make small changes to the docs. Instead, you can edit files directly on GitHub and suggest changes.
When contributing documentation, please do your best to follow the style of the documentation files. This means a soft-limit of 79 characters wide in your text files and a semi-formal, yet friendly and approachable, prose style.
When presenting Python code, use single-quoted strings ('hello'
instead of
"hello"
).
Code Contributions¶
If you intend to contribute code, do not feel the need to sit on your contribution until it is perfectly polished and complete. It helps everyone involved for you to seek feedback as early as you possibly can. Submitting an early, unfinished version of your contribution for feedback can save you from putting a lot of work into a contribution that is not suitable for the project.
Setting up your development environment¶
You need Python 3.7.1+ and the following tools:
Install the package with development requirements:
$ poetry install
You can now run an interactive Python session.
$ poetry run python
Steps for submitting Code¶
When contributing code, you’ll want to follow this checklist:
Fork the repository on GitHub.
Run the tests to confirm they all pass on your system. If they don’t, you’ll need to investigate why they fail. If you’re unable to diagnose this yourself, raise it as a bug report.
Write tests that demonstrate your bug or feature. Ensure that they fail.
Make your change.
Run the entire test suite again, confirming that all tests pass including the ones you just added.
Make sure your code follows the code style discussed below.
Send a GitHub Pull Request to the main repository’s
master
branch. GitHub Pull Requests are the expected method of code collaboration on this project.
Testing the project¶
Run the full test suite:
$ nox
List the available Nox sessions:
$ nox --list-sessions
You can also run a specific Nox session. For example, invoke the unit test suite like this:
$ nox --session=tests
Unit tests are located in the tests
directory,
and are written using the pytest testing framework.
Code style¶
The soccerdata codebase uses the PEP 8 code style. In addition, we have a few guidelines:
Line-length can exceed 79 characters, to 100, when convenient.
Line-length can exceed 100 characters, when doing otherwise would be terribly inconvenient.
Always use single-quoted strings (e.g.
'#soccer'
), unless a single-quote occurs within the string.
To ensure all code conforms to this format. You can format the code using the pre-commit hooks.
$ nox --session=pre-commit
Docstrings are to follow the numpydoc guidelines.
Submitting changes¶
Open a pull request to submit changes to this project.
Your pull request needs to meet the following guidelines for acceptance:
The Nox test suite must pass without errors and warnings.
Include unit tests.
If your changes add functionality, update the documentation accordingly.
Feel free to submit early, though. We can always iterate on this.
To run linting and code formatting checks before committing your change, you can install pre-commit as a Git hook by running the following command:
$ nox --session=pre-commit -- install
It is recommended to open an issue before starting work on anything.
Apache License¶
Copyright (c) 2021 Pieter Robberechts
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This file incorporates code of the footballdata software package covered by the following copyright and permission notice:
Copyright (c) 2017 skagr
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.