train_test_dataset¶
-
cosmo_utils.ml.ml_utils.
train_test_dataset
(pred_arr, feat_arr, pre_opt='min_max', shuffle_opt=True, random_state=0, test_size=0.25, reshape=False, return_idx=False)[source] [edit on github]¶ Function to create the training and testing datasets for a given set of features array and predicted array.
Parameters: pred_arr :
pandas.DataFrame
numpy.ndarray
or array-like, shape (n_samples, n_outcomes)Array consisting of the
predicted values
. The dimensions ofpred_arr
aren_samples
byn_outcomes
, wheren_samples
is the number of observations, andn_outcomes
the number of predicted outcomes.feat_arr :
numpy.ndarray
,pandas.DataFrame
or array-like, shape (n_samples, n_features)Array consisting of the
predicted values
. The dimensions offeat_arr
aren_samples
byn_features
, wheren_samples
is the number of observations, andn_features
the number of features used.pre_opt : {‘min_max’, ‘standard’, ‘normalize’, ‘no’}
str
, optionalType of preprocessing to do on
feat_arr
.- Options:
- ‘min_max’ : Turns
feat_arr
to values between (0,1) - ‘standard’ : Uses
sklearn.preprocessing.StandardScaler
method - ‘normalize’ : Uses the
sklearn.preprocessing.Normalizer
method - ‘no’ : No preprocessing on
feat_arr
- ‘min_max’ : Turns
shuffle_opt :
bool
, optionalIf True, the data is shuffled before splitting into testing and training datasets. This variable is set to True by default.
random_state : int, optional
Random state number used for when splitting into training and testing datasets. If set, it will always have the same seed
random_state
. This variable is set to0
by default.test_size : float, optional
Percentage of the catalogue that represents the
test
size of the testing dataset. This variable must be between (0,1). This variable is set to0.25
by default.reshape :
bool
, optionalIf True, it reshapes
feat_arr
into a 1d array if its shapes is equal to (ncols, 1), wherencols
is the number of columns. This variable is set toFalse
by default.return_idx :
bool
, optionalReturns: train_dict :
dict
Dictionary containing the
training
data from the catalogue.test_dict :
dict
Dictionary containing the
testing
data from the catalogue.See also
data_preprocessing
- Function to preprocess a dataset.