Getting Started¶

New to soccerdata? Well, you came to the right place: this tutorial will walk you through installing, configuring, and using the library. By the end of this tutorial, you will be able to scrape data from the top-5 European leagues and use it to create your own data-driven analyses.

Installation¶

SoccerData can be easily installed via pip:

python3 -m pip install soccerdata

Scraping data¶

Each of the supported data sources has its corresponding class for fetching data with a uniform API. For example, the FBref class is used to fetch data from fbref.com.

import soccerdata as sd

# Create scraper class instance
fbref = sd.FBref()

Once you have a scraper class instance, you can use it to fetch data. See the the examples and API reference for the full list of options available for each scraper. For example, to fetch aggregated shooting stats for all teams:

# Create dataframes
season_stats = fbref.read_team_season_stats(stat_type='shooting')

The data is always returned as a convenient Pandas DataFrame.

league	season	team	#Pl	90s	Gls	Sh	SoT	SoT%	Sh/90	SoT/90	G/Sh	G/SoT	Dist	FK	PK	PKatt	xG	npxG	npxG/Sh	G-xG	np:G-xG
ENG-Premier League	2021	Arsenal	29	38.0	53	455	141	31.0	11.97	3.71	0.1	0.33	16.9	23	6	6	53.5	49.0	0.11	-0.5	-2.0
		Aston Villa	24	38.0	52	518	179	34.6	13.63	4.71	0.09	0.26	16.5	15	5	6	52.9	48.5	0.1	-0.9	-1.5
		Brighton	27	38.0	39	476	129	27.1	12.53	3.39	0.07	0.26	16.6	14	6	9	51.6	44.8	0.1	-12.6	-11.8
		Burnley	25	38.0	32	383	125	32.6	10.08	3.29	0.08	0.23	16.6	15	3	3	39.9	37.6	0.1	-7.9	-8.6
		Chelsea	27	38.0	56	553	194	35.1	14.55	5.11	0.09	0.25	16.3	16	8	10	64.0	56.4	0.1	-8.0	-8.4

By default, the data for all available leagues and the five most recent seasons will be retrieved. However, in most cases, you would want to limit the data to specific leagues and / or seasons. This can be done by passing a list of leagues and seasons to the constructor of the scraper class. For example:

# Create scraper class instance filtering on specific leagues and seasons
fbref = sd.FBref(leagues=['ENG-Premier League'], seasons=['1718', '1819'])
# Retrieve data for the specified leagues and seasons
season_stats = fbref.read_team_season_stats(stat_type='shooting')

Note that only a limited number of leagues are supported out-of-the-box. The leagues available for each source can be listed with the available_leagues() class method.

sd.FBref.available_leagues()
>>> ['ENG-Premier League', 'ESP-La Liga', 'FRA-Ligue 1', 'GER-Bundesliga', 'ITA-Serie A']

You can add more leagues but there are no guarantees that they will be scraped correctly.

Data caching¶

Data caching is used to speed up the runtime and to prevent exceeding the rate limit of web servers. By default, all downloaded data is cached to ~/soccerdata on Linux and Mac OS, and to C:\Users\yourusername\soccerdata on Windows. A custom location can be set if desired. You can configure this using environment variables (see below) or on the level of an individual scraper by setting the data_dir parameter when creating the scraper class instance:

# Create scraper class instance with custom caching directory
fbref = sd.FBref(data_dir="/tmp/FBref")

This directory can be deleted at any time to reclaim disk space. However, this also means you will have to redownload the same data again if you need it, which will lead to reduced performance.

SoccerData has no knowledge of when the data on the server changes, so it is up to the user to decide when to refresh the cache. This can be done by deleting the cache directory or by setting the no_cache option to True when creating the scraper class instance:

# Create scraper class instance which always re-downloads the latest data
fbref = sd.FBref(no_cache=True)

Some methods will assume the cache is always out-of-date (for example, when scraping the fixture of the current season). Typically, these methods will have a force_cache option that can be set to True to force the cached data to be used. For example:

fbref = sd.FBref(leagues=['ENG-Premier League'], seasons=['2324'])
fbref.read_schedule(force_cache=True)

Caching can also be disabled entirely by setting the no_store option to True when creating the scraper class instance. However, it should almost always be left enabled.

# Create scraper class instance with caching disabled
fbref = sd.FBref(no_store=True)

Global configuration¶

Several settings can be configured globally using the following environment variables:

SOCCERDATA_DIR: The directory where the downloaded data is cached and where logs are stored. By default, all data is stored to ~/soccerdata on Linux / Mac OS and C:\Users\yourusername\soccerdata on Windows.
SOCCERDATA_NOCACHE: If set to “true”, no cached data is returned. Note that no-cache does not mean “don’t cache”. All downloaded data is still cached and overwrites existing caches. If the sense of “don’t cache” that you want is actually “don’t store”, then SOCCERDATA_NOSTORE is the option to use. By default, data is retrieved from the cache.
SOCCERDATA_NOSTORE: If set to “true”, no data is stored. By default, data is cached.
SOCCERDATA_MAXAGE: The maximum age of cached data in seconds. If the cached data is older than this, it will be re-downloaded. By default, this is set to infinity.
SOCCERDATA_LOGLEVEL: The level of logging to use. By default, this is set to “INFO”.

Example:

# bash
export SOCCERDATA_DIR = "~/soccerdata"
export SOCCERDATA_NOCACHE = "False"
export SOCCERDATA_NOSTORE = "False"
export SOCCERDATA_LOGLEVEL = "INFO"

Uniform team names¶

Each data source uses a different set of team names, which makes it difficult to combine data from multiple sources. To mitigate this, SoccerData allows translating the team names to uniform names. This is done by providing a SOCCERDATA_DIR/config/team_dict.json file. This file should contain a mapping between a generic name for each team and the team name used by each data source that you want to support. The example below will map “Tottenham Hotspur”, “Tottenham Hotspur FC” and “Spurs” to “Tottenham” in all scraped data.

{
  "Tottenham": ["Tottenham Hotspur", "Tottenham Hotspur FC", "Spurs"],
}

Additional setup for scraping WhoScored data¶

WhoScored implements strong protection against scraping using Incapsula. To circumvent this, this scraper uses Selenium with the ChromeDriver extension to emulate a real user. Before using this scraper, you will have to install Chrome. A Selenium driver matching your Chrome version will be downloaded automatically when you run the scraper.

Next steps¶

Look at you! You’re now basically an expert at SoccerData! ✨

From this point you can:

Look at the example notebooks for each Data source.
Take a deep dive into the API.
Give us feedback or contribute, see Contributing.

Have fun! 🎉