train_test_dataset¶
-
cosmo_utils.ml.ml_utils.train_test_dataset(pred_arr, feat_arr, pre_opt='min_max', shuffle_opt=True, random_state=0, test_size=0.25, reshape=False, return_idx=False)[source] [edit on github]¶ Function to create the training and testing datasets for a given set of features array and predicted array.
Parameters: pred_arr :
pandas.DataFramenumpy.ndarrayor array-like, shape (n_samples, n_outcomes)Array consisting of the
predicted values. The dimensions ofpred_arraren_samplesbyn_outcomes, wheren_samplesis the number of observations, andn_outcomesthe number of predicted outcomes.feat_arr :
numpy.ndarray,pandas.DataFrameor array-like, shape (n_samples, n_features)Array consisting of the
predicted values. The dimensions offeat_arraren_samplesbyn_features, wheren_samplesis the number of observations, andn_featuresthe number of features used.pre_opt : {‘min_max’, ‘standard’, ‘normalize’, ‘no’}
str, optionalType of preprocessing to do on
feat_arr.- Options:
- ‘min_max’ : Turns
feat_arrto values between (0,1) - ‘standard’ : Uses
sklearn.preprocessing.StandardScalermethod - ‘normalize’ : Uses the
sklearn.preprocessing.Normalizermethod - ‘no’ : No preprocessing on
feat_arr
- ‘min_max’ : Turns
shuffle_opt :
bool, optionalIf True, the data is shuffled before splitting into testing and training datasets. This variable is set to True by default.
random_state : int, optional
Random state number used for when splitting into training and testing datasets. If set, it will always have the same seed
random_state. This variable is set to0by default.test_size : float, optional
Percentage of the catalogue that represents the
testsize of the testing dataset. This variable must be between (0,1). This variable is set to0.25by default.reshape :
bool, optionalIf True, it reshapes
feat_arrinto a 1d array if its shapes is equal to (ncols, 1), wherencolsis the number of columns. This variable is set toFalseby default.return_idx :
bool, optionalReturns: train_dict :
dictDictionary containing the
trainingdata from the catalogue.test_dict :
dictDictionary containing the
testingdata from the catalogue.See also
data_preprocessing- Function to preprocess a dataset.