train_test_dataset

cosmo_utils.ml.ml_utils.train_test_dataset(pred_arr, feat_arr, pre_opt='min_max', shuffle_opt=True, random_state=0, test_size=0.25, reshape=False, return_idx=False)[source] [edit on github]

Function to create the training and testing datasets for a given set of features array and predicted array.

Parameters:

pred_arr : pandas.DataFrame numpy.ndarray or array-like, shape (n_samples, n_outcomes)

Array consisting of the predicted values. The dimensions of pred_arr are n_samples by n_outcomes, where n_samples is the number of observations, and n_outcomes the number of predicted outcomes.

feat_arr : numpy.ndarray, pandas.DataFrame or array-like, shape (n_samples, n_features)

Array consisting of the predicted values. The dimensions of feat_arr are n_samples by n_features, where n_samples is the number of observations, and n_features the number of features used.

pre_opt : {‘min_max’, ‘standard’, ‘normalize’, ‘no’} str, optional

Type of preprocessing to do on feat_arr.

Options:
  • ‘min_max’ : Turns feat_arr to values between (0,1)
  • ‘standard’ : Uses sklearn.preprocessing.StandardScaler method
  • ‘normalize’ : Uses the sklearn.preprocessing.Normalizer method
  • ‘no’ : No preprocessing on feat_arr

shuffle_opt : bool, optional

If True, the data is shuffled before splitting into testing and training datasets. This variable is set to True by default.

random_state : int, optional

Random state number used for when splitting into training and testing datasets. If set, it will always have the same seed random_state. This variable is set to 0 by default.

test_size : float, optional

Percentage of the catalogue that represents the test size of the testing dataset. This variable must be between (0,1). This variable is set to 0.25 by default.

reshape : bool, optional

If True, it reshapes feat_arr into a 1d array if its shapes is equal to (ncols, 1), where ncols is the number of columns. This variable is set to False by default.

return_idx : bool, optional

If True, it returns the indices of the training and testing datasets. This variable is set to False by default.

Returns:

train_dict : dict

Dictionary containing the training data from the catalogue.

test_dict : dict

Dictionary containing the testing data from the catalogue.

See also

data_preprocessing
Function to preprocess a dataset.