lalegpl.datasets.multitable.fetch_datasets module

lalegpl.datasets.multitable.fetch_datasets.fetch_imdb_dataset(datatype='pandas')[source]

Fetches the IMDB movie dataset from Relational Dataset Repo. It contains information about directors, actors, roles and genres of multiple movies in form of 7 CSV files. This method downloads and stores these 7 CSV files under the ‘lale/lale/datasets/multitable/imdb_data’ directory. It creates this directory by itself if it does not exists.

Dataset URL: https://relational.fit.cvut.cz/dataset/IMDb

Parameters

datatype (string, optional, default 'pandas') –

If ‘pandas’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the downloaded / existing CSV files. The key of each dictionary is the name of the table and the value contains a pandas dataframe consisting of the data.

If ‘spark’, Returns a list of singleton dictionaries (each element of the list is one table from the dataset) after reading the downloaded / existing CSV files. The key of each dictionary is the name of the table and the value contains a spark dataframe consisting of the data.

Else, Throws an error as it does not support any other return type.

Returns

imdb_list

Return type

list of singleton dictionary of pandas / spark dataframes

lalegpl.datasets.multitable.fetch_datasets.get_data_from_csv(datatype, data_file_name)[source]