RAISING demonstration¶
RAISING is a neural network(NN) framework implemented in a two stage approach, first performing hyperparameter tuning to devise the best NN architecture then training on the data to estimate the feature importance. Here we describe the functionality of RAISING on a simulated data, this will involve introduction of various hyperparameter tuning and feature selection methods.
Import required libraries¶
import pandas as pd
from RAISING.hp_tune_select import *
Load simulated data¶
We will use the simulated data from capblancq et. al.(2018) stored in sim1.csv. There are 64 population and authors have sampled 10 individuals from each of the the population at the end of simulations. Simulated data has 640 individuals and 1000 Loci which is input data and output data has 640 individuals and 10 environmental factors.
df = pd.read_csv("sim1.csv")
We will store the input genotype matrix into X_data variable and environmental matrix in y_data
X_data = df.iloc[:,14:]
X_data.columns = ["X" + str(i) for i in range(1, X_data.shape[1]+1)]
y_data = df.iloc[:,1:11]
print(X_data.head())
print(y_data.head())
Hyperparameter tuning methods implementation¶
We will use hp_optimization() function defined in RAISING for hyperparameter tuning.The description of the function can be found in README file README_RAISING.html. We will use various methods for hyperparameter tuning and then we will perform feature selection using various methods described in documentation.
Since the supervised learning task is Regression the default loss function is mse and metric is R-squared. In this example we have closedn validation loss(val_loss) as the objective function to be minimised. These are keras and tensorflow parameters which can be changed if needed.
1) Hyperparameter tuning using Bayesian optimization¶
Bayesian_tune_model = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="continuous",
algorithm = "Bayesian",model_file = "Bayesian_NN_architecture_HP.keras", max_trials=30)
print(Bayesian_tune_model.summary())
2) Hyperparameter tuning using RandomSearch¶
RS_tune_model = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="continuous",
algorithm = "RandomSearch",model_file = "RS_NN_architecture_HP.keras", max_trials=50)
print(RS_tune_model.summary())
3) Hyperparameter tuning using RSLM¶
RSLM_tune_model = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="continuous",
algorithm = "RSLM",model_file = "RSLM_NN_architecture_HP.keras", max_trials=50)
print(RSLM_tune_model.summary())
4) Hyperparameter tuning using Hyperband¶
Hyperband_tune_model = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="continuous",
algorithm = "Hyperband",model_file = "Hyperband_NN_architecture_HP.keras", max_epochs=20)
print(Hyperband_tune_model.summary())
5) Hyperparameter tuning using Bayesian optimization with cross-validation¶
In the previous examples, we have performed hyperparamter tuning based on train-test-split strategy.
By specifying the parameter cross_validation = True, we can perform the K-fold cross validation. By default 3-Fold (n_splits = 3) is performed.
Bayesian_CV_tune_model = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="continuous",
algorithm = "Bayesian",model_file = "Bayesian_CV_NN_architecture_HP.keras",
cross_validation = True,max_trials=20)
5) Hyperparameter tuning using Bayesian optimization with user-defined hyperparameter space¶
We will use hyperparameter_config.json file for this example. Here we only perform L1 regularization.
Bayesian_US_tune_model = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="continuous",
algorithm = "Bayesian",model_file = "Bayesian_US_NN_architecture_HP.keras",max_trials=20,
config_file = "hyperparameter_config.json")
Creating a function to generate the plots of feature importance estimates¶
We will scale the feature importance of genetic loci between [0,1] corresponding to linear environment gradient(envir2). We will generate a plot where we will have on the x-axis (genetic loci position) and on the y-axis (feature importance corresponding to linear environment)
import matplotlib.pyplot as plt
import numpy as np
actual_pos = [101,111,121,131,141,151,161,171,181,191]
def minmaxscale(vec):
return((vec - min(vec))/(max(vec) - min(vec)))
def plot_feature_importance(GenFeat_df,actual_pos):
GenFeat_df["scaled_feat"] = minmaxscale(vec = GenFeat_df["envir2"])
GenFeat_df["loci_index"] = GenFeat_df.index+1
plt.figure(figsize=(10, 6))
plt.plot(GenFeat_df['loci_index'], GenFeat_df['scaled_feat'], marker='o', linestyle='-', color='b')
highlight_df = GenFeat_df[GenFeat_df['loci_index'].isin(actual_pos)]
plt.scatter(highlight_df['loci_index'], highlight_df['scaled_feat'], color='red', zorder=5)
plt.title('Line Plot of Scaled Features by Loci corresponding to linear environment')
plt.xlabel('Loci Index')
plt.ylabel('Scaled Feature')
plt.grid(True)
plt.show()
Feature Selection methods implementation on architecture obtained using Bayesian optimization¶
We will use feature_importance() function defined in RAISING for feature selection.The description of the function can be found in README file README_RAISING.html.
Here we will specify iteration parameter to perform feature importance based on a single iteration of model training. For the tutorial purpose, we restricted ourselves to architecture obtained using Bayesian method, Bayesian_NN_architecture_HP.keras
1) DeepFeatImp method¶
Feature Importance based on single iteration of model training (iteration =1)
DeepFeatImp_df = feature_importance(input_data=X_data, output_data=y_data, feature_set=X_data.columns.to_list(),iteration=1,
feature_method="DeepFeatImp",model_file = "Bayesian_NN_architecture_HP.keras")
plot_feature_importance(GenFeat_df=DeepFeatImp_df.copy(),actual_pos = actual_pos)
Feature Importance based on five iterations of model training (iteration =5)
DeepFeatImp_df = feature_importance(input_data=X_data, output_data=y_data, feature_set=X_data.columns.to_list(),iteration=5,
feature_method="DeepFeatImp",model_file = "Bayesian_NN_architecture_HP.keras")
plot_feature_importance(GenFeat_df=DeepFeatImp_df.copy(),actual_pos = actual_pos)
2) DeepExplainer method¶
Feature Importance based on five iterations of model training (iteration =5)
DeepExplainer_df = feature_importance(input_data=X_data, output_data=y_data, feature_set=X_data.columns.to_list(), iteration=5,
feature_method="DeepExplainer",model_file = "Bayesian_NN_architecture_HP.keras")
plot_feature_importance(GenFeat_df=DeepExplainer_df.copy(),actual_pos = actual_pos)
3) KernelExplainer method¶
KernelExplainer_df = feature_importance(input_data=X_data, output_data=y_data, feature_set=X_data.columns.to_list(), iteration=1,
feature_method="KernelExplainer",nsamples=10,model_file = "Bayesian_NN_architecture_HP.keras")
plot_feature_importance(GenFeat_df=KernelExplainer_df.copy(),actual_pos = actual_pos)
Binary classification example with Bayesian optimization method and DeepFeatImp¶
We will use breast cancer dataset from scikit-learn for the binary classification demonstration
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
from sklearn.datasets import load_breast_cancer
X_data, y_data = load_breast_cancer(return_X_y=True, as_frame=True)
y_data = pd.DataFrame(y_data)
Bayesian_tune_model_binary = hp_optimization(input_data=X_data, output_data=y_data, objective_fun="val_loss", output_class="binary",
algorithm = "Bayesian",model_file = "Bayesian_binary_NN_architecture_HP.keras", max_trials=20)
print(Bayesian_tune_model_binary.summary())
DeepFeatImp_df_binary = feature_importance(input_data=X_data, output_data=y_data, feature_set=X_data.columns.to_list(),iteration=10,
output_class = "binary",feature_method="DeepFeatImp",
model_file = "Bayesian_binary_NN_architecture_HP.keras")
DeepFeatImp_df_binary["scaled_feat"] = minmaxscale(vec = DeepFeatImp_df_binary["target"])
DeepFeatImp_df_binary = DeepFeatImp_df_binary.sort_values(by='target', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(DeepFeatImp_df_binary['features'], DeepFeatImp_df_binary['target'], color='skyblue')
plt.xlabel('Feature Importance')
plt.gca().invert_yaxis() # Invert y-axis to have the highest importance at the top
plt.show()