Detecting Muon Momentum in the CMS Experiment at CERN using Deep Learning
Posted on Thu 19 November 2020 in posts • 12 min read
Introduction¶
In the following notebook, we will apply several Deep Learning approaches to properly predict muon momentum. We will use Monte-Carlo simulated data from the Cathode Strip Chambers (CSC) at the CMS experiment. These chambers detect muon particles in the outer layer of the CMS detector, allowing us to store information about the hit locations.
The dataset contains more than 3 million muon events generated using Pythia.
You can also check the article where we summarize our project and explain the code here.
Importing Libraries¶
Let's import the libraries that we will need in this project:
First, let's import the libraries that we will need for this project:
- Numpy - for vectors manipulation and operations
- Scikit-Learn - for building and training predictive models
- Matplotlib and Seaborn - for visualizing various types of plots
- Pandas - for data cleaning and analysis
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
np.random.seed(48) # set the random seed for reproducible results
For training Neural Networks, we will be using Tensorflow 2 which is a famous Deep Learning library used for developing, training and deploying deep learning models.
Let's import Tensorflow and check its version:
import tensorflow as tf
print(tf.__version__)
Retrieving the Dataset¶
It seems our data is in NPZ format, therefore will use numpy.load() to load the data files:
data=np.load("histos_tba.npz")
The data file contain two arrays: variables and parameters.
data["variables"].shape,data["parameters"].shape
We will extract the data from the variabes array for the muon hits detected at the CSC chambers only:
def delete_not_csc(data):
muon_hits=data['variables']
indices_to_del=[]
for i in range(muon_hits.shape[1]):
if(i%12>=5 and i<84):
indices_to_del.append(i)
return np.delete(muon_hits,indices_to_del,axis=1)
muon_hits=delete_not_csc(data)
muon_hits.shape
Next, let's prepare the columns names for each feature and combine the two datasets into a Pandas DataFrame for easier analysis:
original_columns_names=["Phi angle","Theta angle","Bend angle","Time", "Ring",
"Front/Rear","Mask"]
columns_names=[]
for element in enumerate(original_columns_names):
for i in range(5):
columns_names.append(str(element[1])+str(i))
columns_names.append("XRoad0")
columns_names.append("XRoad1")
columns_names.append("XRoad2")
muon_hits_df=pd.DataFrame(muon_hits,columns=columns_names)
muon_hits_df["q/pt"]=data["parameters"][:,0]
muon_hits_df["Phi_angle"]=data["parameters"][:,1]
muon_hits_df["Eta_angle"]=data["parameters"][:,2]
Exploratory Data Analysis (EDA)¶
Let's start our EDA by taking a look at the columns types:
muon_hits_df.info()
We notice that all our variables have float data type, which we will prove helpful later on when preprocessing the data.
Next, let's generate some useful statistics about our features:
muon_hits_df.describe()
We notice that some features such as Phi angle1, Theta angle1 and Bend angle1 have some missing values.
Let's find the exact percentage of null values for each feature:
null_perc=((muon_hits_df.isnull().
sum()/muon_hits_df.
shape[0]
)*100).sort_values(ascending=False)
fig = plt.figure(figsize = (10,5))
ax = fig.gca()
ax.set_xlabel("features names")
ax.set_ylabel("missing values percentage")
null_perc.plot.bar(ax=ax)
plt.show()
We notice that Front/Rear1, Phi angle1, Theta angle1, Bend angle1, Time1 and Ring1 features have more than 70% missing values!
Next, let's take a look on the data distribution of each feature:
fig = plt.figure(figsize = (30,40))
ax = fig.gca()
muon_hits_df.hist(ax=ax)
plt.show()
Wow, those are a LOT of plots! It is clear from these plots that some features such as Theta Angle1,Theta Angle2, Theta Angle3 and XRoad1 can benefit from being transformed into Normal Distribution (Standardization).
Next, let's investigate the effect of each feature on the target feature q/pt:
corr=muon_hits_df.corr()
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
Framing the Problem as a Classification Task¶
Instead of trying to apply regression to predict the muon momenta values (q/pt), let's try to cluster the momenta into 4 classes: 0-10 GeV, 10-30 GeV, 30-100 GeV and >100 GeV, and instead frame the problem as a classification task.
Let's investigate the q/pt column and separate it from the data:
pt_inverse_labels=muon_hits_df["q/pt"]
muon_hits_df_x=muon_hits_df.drop("q/pt",axis=1)
muon_hits_df["q/pt"].describe()
We will use the absolute value of the reciprocal of the target feature (pt/q) to group the momenta into the previously mentioned classes:
pt_labels=abs((1/pt_inverse_labels))
pt_labels.describe()
Let's visualize the data distribution of this feature:
pt_labels.hist(bins=1000)
plt.xlim([0,150])
plt.title("Data Distribution of p_T labels")
plt.show()
That is a very interesting plot! We can clearly see that there are distinct groups of points of different magnitudes.
We can now group the values into 4 groups: 0-10 GeV, 10-30 GeV, 30-100 GeV and >100 GeV using the cut method from Pandas.
bins = pd.IntervalIndex.from_tuples([(0, 10), (10, 30), (30, 100),(100,pt_labels.max()+1)])
org_pt_labels_groups=pd.cut(pt_labels, bins)
Let's plot the grouped data now:
target_counts=org_pt_labels_groups.value_counts()
target_counts.plot.bar()
plt.title("Distribution of classes in target feature")
plt.show()
It is very clear that the classes are not balanced, therefore it is important to balance those classes whenever we start training our Neural Network to avoid any bias.
Splitting the Data¶
Next, let's seperate the data into train and test data, we will use a 90%-10% split.
As we already noticed, the data is highly imbalanced, therefore it is important to perform the following two tasks:
- Add class weights to balance out the classes
- Make the test set representative of the classes
To split the train and test data, we will use the StratifiedShuffleSplit method in Scikit-Learn to make both the train data and test data representative of the target classes distribution:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
for train_index, test_index in split.split(muon_hits_df_x, org_pt_labels_groups):
muon_hits_x_train_set = muon_hits_df_x.loc[train_index]
pt_labels_groups_y_train_set=org_pt_labels_groups[train_index]
muon_hits_x_test_set = muon_hits_df_x.loc[test_index]
pt_labels_groups_y_test_set=org_pt_labels_groups[test_index]
Let's verify if the data has the same proportions for the target label in the train and test sets:
def target_proportions(data):
return data.value_counts()/len(data)*100
proportions_df=pd.DataFrame({"Original": target_proportions(org_pt_labels_groups),
"Train":target_proportions(pt_labels_groups_y_train_set),
"Test":target_proportions(pt_labels_groups_y_test_set)},
)
proportions_df
Neat! Now that we are sure that the train and test data are representative of the original dataset, we can now proceed to the preprocessing phase.
Preparing data for training¶
As we noticed earlier, we need delete the columns with percentage of missing values bigger than 70%:
def delete_columns(df,perc):
null_perc=(df.isnull().sum()/df.shape[0])*100
col_to_del= [col for col in df.columns if(((null_perc)>perc)[col])]
print("columns deleted:", col_to_del)
return df.drop(col_to_del,axis=1)
muon_hits_x_train_set_1=delete_columns(muon_hits_x_train_set,70)
muon_hits_x_train_set_1.shape
For all other features with percentage of null values less than 70%, we will replace the null values with the mean value of their corresponding columns:
def replace_missing_with_mean(df):
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed=imputer.fit_transform(df)
return pd.DataFrame(imputed,columns=df.columns)
muon_hits_x_train_set_2=replace_missing_with_mean(muon_hits_x_train_set_1)
null_perc_2=(muon_hits_x_train_set_2.isnull().
sum()/muon_hits_x_train_set_2.
shape[0])*100
null_perc_2.sort_values(ascending=False)
Now we are sure that we filled all missing values.
Let's next create the preprocessing pipeline and convert the target feature into one hot encoded variables.
def preprocess_pipeline(df_x,df_y):
#delete columns with missing values more than 70%
df_x=delete_columns(df_x,70)
#impute the other columns with missing values with their corresponding means
df_x=replace_missing_with_mean(df_x)
#standardize the data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x=scaler.fit_transform(df_x)
#convert labels into dummy variables (one hot encoded)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded_y = encoder.fit_transform(df_y)
print(encoder.classes_)
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(sparse=False)
ohe_y=ohe.fit_transform(encoded_y.reshape(-1,1))
print(ohe.categories_)
return x,ohe_y,encoded_y
muon_hits_x_train_values,pt_labels_ohe_y_train,y_encoded_train=preprocess_pipeline(muon_hits_x_train_set,pt_labels_groups_y_train_set)
muon_hits_x_test_values,pt_labels_ohe_y_test,y_encoded_test=preprocess_pipeline(muon_hits_x_test_set,pt_labels_groups_y_test_set)
Now our datasets are ready for training and testing!
One last task to do before training, which is generating the classes weights:
#option 1
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced',
np.unique(y_encoded_train),
y_encoded_train)
class_weights
#option 2
class_weights=len(y_encoded_train)/(4*(np.unique(y_encoded_train,return_counts=True)[1]))
classes_weights={i: class_weights[i] for i in range(len(class_weights))}
classes_weights
Building the Neural Network¶
We will use Tensorflow's Keras library to build the Neural Network
Building a Fully Connected Neural Network¶
Let's first build and compile the model:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Dropout
model=Sequential()
model.add(Dense(512,input_dim=(muon_hits_x_train_values.shape[1]),activation='relu'))
model.add(Dense(256,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(128,activation='relu'))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(32,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(4,activation='softmax'))
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
Let's take a closer look at the model architecture:
model.summary()
Next, we will set some callbacks for the model including a Checkpoint callback and Early Stopping callback that both monitor the model's validation accuracy:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
filepath="classifier_weights2-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
checkpoint1 = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
checkpoint2= EarlyStopping( monitor='val_accuracy', patience=3)
callbacks_list = [checkpoint1,checkpoint1]
Now we can finally start training!
We will use a 25% validation split, batch size of 500 and will train for 100 epochs.
history = model.fit(muon_hits_x_train_values, pt_labels_ohe_y_train,
validation_split=0.25,
epochs=200,
batch_size=500,
callbacks=callbacks_list,
class_weight=classes_weights)
Next, let's plot the evolution of the metrics over the epochs of training:
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig("accuracy2.png")
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.title('model train loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.savefig("loss2-1.png")
plt.show()
plt.plot(history.history['val_loss'])
plt.title('model validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.savefig("loss2-2.png")
plt.show()
Let's load the model with the best performance:
model.load_weights("/content/classifier_weights-improvement-94-0.84.hdf5")
Testing the Model¶
We will examine the accuracy of the model on the train set and the test set:
Let's generate a classification report that will give us insights on the performance of the model on both train set and test set:
from sklearn.metrics import classification_report
y_train_pred=model.predict(muon_hits_x_train_values)
y_train_pred_best=np.argmax(y_train_pred,axis=1)
train_report=classification_report(y_encoded_train,y_train_pred_best)
y_test_pred=model.predict(muon_hits_x_test_values)
y_test_pred_best=np.argmax(y_test_pred,axis=1)
test_report=classification_report(y_encoded_test,y_test_pred_best)
print("Train report \n",train_report)
print("Test report \n",test_report)
That's impressive! It seems that adding class weights definitely reduced bias and we reached a weighted f1-score of 0.85 on the test data!
Experimenting with Regression¶
Instead of classifying into four group, let us try to directly predict the q/pT value for each muon momentum:
Splitting the Data¶
Let us first split the dataset into train and test data:
from sklearn.model_selection import train_test_split
muon_hits_x_train,muon_hits_x_test,pt_inverse_labels_train, pt_inverse_labels_test = train_test_split(muon_hits_df_x,pt_inverse_labels,test_size=0.1, random_state=42)
Next, let's create the pipeline for preprocessing the data for regression:
def preprocess_pipeline_reg(df_x,df_y):
#delete columns with missing values more than 70%
df_x=delete_columns(df_x,70)
#impute the other columns with missing values with their corresponding means
df_x=replace_missing_with_mean(df_x)
#standardize the data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x=scaler.fit_transform(df_x)
return x,df_y.values
muon_hits_x_train_values,pt_inverse_labels_train_values=preprocess_pipeline_reg(muon_hits_x_train, pt_inverse_labels_train)
muon_hits_x_test_values,pt_inverse_labels_test_values=preprocess_pipeline_reg(muon_hits_x_test, pt_inverse_labels_test)
Building the Model¶
Let's first build and compile the model:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Dropout
reg_model=Sequential()
reg_model.add(Dense(512,input_dim=(muon_hits_x_train_values.shape[1]),activation='relu'))
reg_model.add(Dense(256,activation='relu'))
reg_model.add(Dropout(0.1))
reg_model.add(Dense(128,activation='relu'))
reg_model.add(Dense(64,activation='relu'))
reg_model.add(Dropout(0.1))
reg_model.add(Dense(32,activation='relu'))
reg_model.add(Dropout(0.1))
reg_model.add(Dense(1,activation='linear'))
reg_model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),loss='mean_squared_error')
reg_model.summary()
Next, we will set some callbacks for the model including a Checkpoint callback and Early Stopping callback that will both monitor the model's validation mean square error:
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
filepath="reg_weights2-improvement-{epoch:02d}-{val_loss:.6f}.hdf5"
checkpoint1 = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
checkpoint2= EarlyStopping( monitor='val_loss', patience=3)
callbacks_list = [checkpoint1,checkpoint1]
reg_history = reg_model.fit(muon_hits_x_train_values, pt_inverse_labels_train_values, validation_split=0.25, epochs=200, batch_size=2000,callbacks=callbacks_list)
Next, let's plot the evolution of the model over the epochs of training:
# list all data in history
print(reg_history.history.keys())
# summarize history for loss
plt.plot(reg_history.history['loss'])
plt.plot(reg_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.savefig("reg_loss.png")
plt.show()
Model Testing¶
Let's evaluate the model performance over the train and test set:
reg_model.load_weights("/content/reg_weights2-improvement-81-0.002230.hdf5")
from sklearn.metrics import mean_squared_error
y_train_pred_reg=reg_model.predict(muon_hits_x_train_values)
print("train root mean squared error: ",mean_squared_error(pt_inverse_labels_train_values,y_train_pred_reg,squared=False))
y_test_pred_reg=reg_model.predict(muon_hits_x_test_values)
print("test root mean squared error: ",mean_squared_error(pt_inverse_labels_test_values,y_test_pred_reg,squared=False))
That is impressive! the root mean square error on the testing data is relatively small in comparison to the data!
Future Directions¶
- We discovered in this project that the classifier model was not very good in detecting minority classes. Therefore we need to look for more techniques to make the model less biased (such as up sampling minority classes or down sampling the majority class).
- We can further improve the performance of our regression model by hyperparameters tuning through grid search.