pycol: Data Sources

Artificial Datasets

To explore the complexity implemented in pycol, the user may refer to the dataset folder in the GitHub repository.

Alternatively, it is also possible to generate custom artificial datasets using the a data generator that outputs files in .arff format (you may find available documentation here).

In case the user wishes to select datasets with specific complexity characteristics, the pycol package also offer an extensive benchmark of previously computed complexity measures, available in this .csv file. The used datasets for this benchmark are also available in the dataset/alg_sel folder here.

Benchmark of Imbalanced Datasets

To experiment with a large benchmark of imbalanced datasets, the user is referred to KEEL Datastet Repository, containing a selection of several datasets categorized by IR.

Other Real-World Datasets

There are several publicly available data sources that can be used while exploring pycol:

Reading datasets into pycol

The first step when using pycol is to instantiate the Complexity class. When doing this, the user must provide the dataset that is going to be analysed, the distance function that is going to be used to calculate the distance between samples and the file type. The example below showcases the analysis of the dataset 61_iris.arff, choosing the default distance function (HEOM) and specifying that the dataset is in the arff format:

Complexity('61_iris.arff',
           distance_func='default',
           file_type='arff')

Alternatively, a user might want to load a dataset directly into pycol from an array, for example after fetching a dataset from sklearn. To do this, the user must specify the file_type argument as "array" and provide a python dictionary with the keys X, containing the data and y containing the target labels.

dataset = load_breast_cancer()
X = dataset.data
y = dataset.target
dic = {'X':X, 'y':y}
complexity = Complexity(dataset=dic,
             file_type="array")