WhoScored¶
- class soccerdata.WhoScored(leagues=None, seasons=None, proxy=None, no_cache=False, no_store=False, data_dir=PosixPath('/home/docs/soccerdata/data/WhoScored'), path_to_browser=None, headless=False)¶
Provides pd.DataFrames from data available at http://whoscored.com.
Data will be downloaded as necessary and cached locally in
~/soccerdata/data/WhoScored
.- Parameters:
leagues (string or iterable, optional) – IDs of Leagues to include.
seasons (string, int or list, optional) – Seasons to include. Supports multiple formats. Examples: ‘16-17’; 2016; ‘2016-17’; [14, 15, 16]
proxy ('tor' or dict or list(dict) or callable, optional) –
- Use a proxy to hide your IP address. Valid options are:
”tor”: Uses the Tor network. Tor should be running in the background on port 9050.
dict: A dictionary with the proxy to use. The dict should be a mapping of supported protocols to proxy addresses. For example:
{ 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080', }
list(dict): A list of proxies to choose from. A different proxy will be selected from this list after failed requests, allowing rotating proxies.
callable: A function that returns a valid proxy. This function will be called after failed requests, allowing rotating proxies.
no_cache (bool) – If True, will not use cached data.
no_store (bool) – If True, will not store downloaded data.
data_dir (Path) – Path to directory where data will be cached.
path_to_browser (Path, optional) – Path to the Chrome executable.
headless (bool, default: True) – If True, will run Chrome in headless mode. Setting this to False might help to avoid getting blocked. Only supported for Selenium <4.13.
- read_schedule(force_cache=False)¶
Retrieve the game schedule for the selected leagues and seasons.
- Parameters:
force_cache (bool) – By default no cached data is used for the current season. If True, will force the use of cached data anyway.
- Return type:
pd.DataFrame
- read_missing_players(match_id=None, force_cache=False)¶
Retrieve a list of injured and suspended players ahead of each game.
- Parameters:
match_id (int or list of int, optional) – Retrieve the missing players for a specific game.
force_cache (bool) – By default no cached data is used to scrapre the list of available games for the current season. If True, will force the use of cached data anyway.
- Raises:
ValueError – If the given match_id could not be found in the selected seasons.
- Return type:
pd.DataFrame
- read_events(match_id=None, force_cache=False, live=False, output_fmt='events', retry_missing=True, on_error='raise')¶
Retrieve the the event data for each game in the selected leagues and seasons.
- Parameters:
match_id (int or list of int, optional) – Retrieve the event stream for a specific game.
force_cache (bool) – By default no cached data is used to scrape the list of available games for the current season. If True, will force the use of cached data anyway.
live (bool) – If True, will not return a cached copy of the event data. This is usefull to scrape live data.
output_fmt (str, default: 'events') –
- The output format of the returned data. Possible values are:
’events’ (default): Returns a dataframe with all events.
’raw’: Returns the original unformatted WhoScored JSON.
’spadl’: Returns a dataframe with the SPADL representation of the original events. See https://socceraction.readthedocs.io/en/latest/documentation/SPADL.html#spadl
’atomic-spadl’: Returns a dataframe with the Atomic-SPADL representation of the original events. See https://socceraction.readthedocs.io/en/latest/documentation/SPADL.html#atomic-spadl
’loader’: Returns a socceraction.data.opta.OptaLoader instance, which can be used to retrieve the actual data. See https://socceraction.readthedocs.io/en/latest/modules/generated/socceraction.data.opta.OptaLoader.html#socceraction.data.opta.OptaLoader
None: Doesn’t return any data. This is useful to just cache the data without storing the events in memory.
retry_missing (bool) – If no events were found for a game in a previous attempt, will retry to scrape the events
on_error ("raise" or "skip", default: "raise") – Wheter to raise an exception or to skip the game if an error occurs.
- Raises:
ValueError – If the given match_id could not be found in the selected seasons.
ConnectionError – If the match page could not be retrieved.
ImportError – If the requested output format is ‘spadl’, ‘atomic-spadl’ or ‘loader’ but the socceraction package is not installed.
- Return type:
See the description of the
output_fmt
parameter.
- classmethod available_leagues()¶
Return a list of league IDs available for this source.
- Return type:
list[str]