The detection of muons is an important task in the CMS experiment at CERN (European Organization for Nuclear Research). Muons are usually expected to be produced by many physical experiments studied in depth at the CMS physics program at CERN. For example, one way to study the famous Higgs Boson particle is through a decay channel where the Higgs Boson decays into four muons.

Since muons can penetrate several meters of iron with no significant energy loss, they are hardly stopped by any layer of the CMS system. Therefore, the muon detectors are the outermost ones, far from the interaction point. The current muon transverse momentum detector, the Barrel Muon Track Finder, uses Lookup Tables to estimate the transverse momentum of the muons. The latter is measured using the pseudo-rapidity eta, a spatial quantity related to the angle at which the charged particles emerge from these collisions, and the azimuth angle phi, the angle that the collision paths make with the z-axis. (Source: CMS Level-1 Trigger Muon Momentum assignment with Machine Learning - Tommaso Diotalevi)

muon graph

Figure 1: Adopted Coordinate System

Any detection of the muon at this level is called a trigger.

Problem Description

Much more sophisticated techniques are needed to improve the precision of the momentum detection to reduce the trigger rates. Since the prompt muon spectrum follows an exponential distribution, any small improvement in the precision might significantly reduce the number of triggers, especially by reducing the number of low momentum muons misclassified as high momentum muons.

And that’s where machine learning comes into place. By applying smarter algorithms such as Neural Networks, we can improve the precision of the trigger systems to accurately detect the muons and their corresponding transverse momenta. (Source: CMS Level-1 Trigger Muon Momentum assignment with Machine Learning - Tommaso Diotalevi)

In the following article, we will try to use several Deep Learning approaches to accurately predict muon momentum. We will use Monte-Carlo simulated data from the Cathode Strip Chambers (CSC) at the CMS experiment. The dataset contains more than 3 million muon events generated using Pythia.

You can also check the Jupyter notebook we prepared for preprocessing and training the models here.

Exploring the Dataset

The data we retrieved is organized as follows:

  • There are 87 features: 84 related to the detectors and 3 serving as path variables

  • There are 12 detector planes inside the CSC and each has 7 features (12*7=84)

    The features are organized in the following table:

    Column Range0-1112-2324-3536-4748-5960-7172-8384-86
    featuresPhi coordinateTheta coordinateBending angleTime infoRing numberFront/rear hitMaskX_road

Table 1: Features and their indices in the dataset

In addition to other features:

Column Range878889
featuresPhi angleEta angleq/pt (Muon Momentum) - target label

Table 2: Additional Features and their indices in the dataset

Inside the muon chamber, there are multiple detector stations as it is shown in the figure below. These detector planes are organized in specific chambers.

muon chambers

Figure 2: Schematic view of a CMS quadrant (Source: CERN Document Server)

However, we only care about the muon hits in the CSC chambers and therefore features related to the CSC chambers only.

Let us first delete those unneeded features:

import numpy as np
import pandas as pd
def delete_not_csc(data):
#select the data from the variables list
muon_hits=data['variables']
#define the list of features indices to drop
indices_to_del=[]
for i in range(muon_hits.shape[1]):
if(i%12>=5 and i<84):
indices_to_del.append(i)
#delete the features with matching indices
return np.delete(muon_hits,indices_to_del,axis=1)
#load the data into a numpy array
data=np.load("histos_tba.npz")
#delete unneeded features
muon_hits=delete_not_csc(data)

Now that we deleted those features, we can next fit all the features into a compact Pandas data frame for easier data analysis later on.

Exploratory Data Analysis

Let’s start by taking a look on the data statistics using Pandas’ describe method:

Phi angle1Theta angle1Bend angle1Time1Ring1Front/Rear1
count870666.0870666.0870666.0870666.0870666.0870666.0
mean2942.73828162.888813-0.075262-0.0291502.00.465420
std1061.83923310.49876619.8466760.1675720.00.499028
min690.00000046.000000-81.000000-1.0000002.00.000000
25%2032.00000054.000000-14.0000000.0000002.00.000000
50%2937.00000062.0000000.0000000.0000002.00.000000
75%3853.00000071.00000014.0000000.0000002.01.000000
max4950.00000088.00000070.0000000.0000002.01.000000

Table 3: Statistics about features with missing values

A quick look reveals that some features such as Phi angle1, Theta angle1 and Bend angle1 have some missing values. Let’s find the exact percentage of null values for each feature:

import matplotlib.pyplot as plt
#generate the percentages of missing values sorted from the highest to lowest
null_perc=((muon_hits_df.isnull().
sum()/muon_hits_df.
shape[0]
)*100).sort_values(ascending=False)
#create a figure object to specify the figure size and axes' labels
fig = plt.figure(figsize = (10,5))
ax = fig.gca()
ax.set_xlabel("features names")
ax.set_ylabel("missing values percentage")
#generate the bar graph
null_perc.plot.bar(ax=ax)
plt.show()
missing values hist

Figure 3: Missing values distribution

We notice that Front/Rear1, Phi angle1, Theta angle1, Bend angle1, Time1 and Ring1 features have more than 70% missing values! We need to remove those features later on, and for the other features with less than 25% of missing values, we will fill the missing values with each column’s mean.

Next, we will visualize the data distribution of all features.

import matplotlib.pyplot as plt
#generate the percentages of missing values sorted from the highest to lowest
null_perc=((muon_hits_df.isnull().
sum()/muon_hits_df.
shape[0]
)*100).sort_values(ascending=False)
#create a figure object to specify the figure size and axes' labels
fig = plt.figure(figsize = (10,5))
ax = fig.gca()
ax.set_xlabel("features names")
ax.set_ylabel("missing values percentage")
#generate the bar graph
null_perc.plot.bar(ax=ax)
plt.show()
features hists

Figure 4: Selected features histograms

We notice as shown in the figure above that some features such as Theta angle1, Theta angle2, Theta angle3 and X Road1 need to be transformed into Normal distributions (using Standardization) if we want to improve the learning speed of the model.

Data Preprocessing for Training

Framing the Problem as a Classification Task

Instead of trying to apply regression to predict the muon momenta values (q/pt), let’s first try to cluster the momenta into 4 classes: 0-10 GeV, 10-30 GeV, 30-100 GeV and >100 GeV, and hence frame the problem as a classification task.

In order to group the momenta, we will use the absolute value of the reciprocal of the target feature (pt/q).

p_t hist

Figure 5: The data distribution of the target feature

As shown in figure, it is clear that there are distinct groups of different magnitude. We can use the cut method from Pandas to cluster the muon momenta:

#generate p_T feature
pt_inverse_labels=muon_hits_df["q/pt"]
pt_labels=abs((1/pt_inverse_labels))
#generate interval bins to group the target feature
bins = pd.IntervalIndex.from_tuples([(0, 10), (10, 30), (30, 100),(100,pt_labels.max()+1)])
#use the cut function to group values within the same range to the same interval bin
org_pt_labels_groups=pd.cut(pt_labels, bins)
#check the distribution of the newly generated groups in bar graph
target_counts=org_pt_labels_groups.value_counts()
target_counts.plot.bar()
plt.show()

classes bar

Figure 6: The data distribution of the target classes

Now we have successfully grouped the labels!

We can observe that the classes are not balanced, therefore we need to balance those classes whenever we start training our Neural Network to avoid any bias.

Splitting the Dataset into Train and Test sets

Next, let’s separate the data into train and test data, we will use a 90%-10% split.

It is important to make the test set representative of the target classes distribution in the original dataset. Therefore, instead of randomly splitting our original dataset, we will use the StratifiedShuffleSplit method in Scikit-Learn to make both the train data and test data representative of the original distribution of classes.

import sklearn
from sklearn.model_selection import StratifiedShuffleSplit
#drop the target feature from the dataset
muon_hits_df_x=muon_hits_df.drop("q/pt",axis=1)
#create the StratifiedShuffleSplit object
split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42)
#split the original dataset into train and test sets
for train_index, test_index in split.split(muon_hits_df_x, org_pt_labels_groups):
#use train indices to get the train set
muon_hits_x_train_set = muon_hits_df_x.loc[train_index]
pt_labels_groups_y_train_set=org_pt_labels_groups[train_index]
#use test indices to get the test set
muon_hits_x_test_set = muon_hits_df_x.loc[test_index]
pt_labels_groups_y_test_set=org_pt_labels_groups[test_index]
view raw stratify.py hosted with ❤ by GitHub

Let’s check the classes distribution in the newly generated sets:

ClassesOriginalTrainTest
(0.0, 10.0]77.20106277.20105877.201094
(10.0, 30.0]15.20795115.20794215.208031
(30.0, 100.0]5.3318715.3318625.331948
(100.0, 6990.34]2.2591172.2591382.258927

Table 4: Comparing the classes distribution between the original, train and test sets

As the previous table shows, both the train set and test set are representative of the original dataset. We can now safely proceed to the final step of preprocessing: preparing the pipeline.

Preparing the Preprocessing Pipeline

This pipeline will perform the following tasks:

  1. Delete columns with missing values more than 70%

  2. Impute the other columns with missing values with their corresponding means using Scikit-Learn Simple Imputer

  3. Standardize the data with Scikit-Learn Standard Scaler

  4. Encode the target feature into four one hot encoded features

Here is the pipeline full code:

def preprocess_pipeline(df_x,df_y):
#delete columns with missing values more than 70%
df_x=delete_columns(df_x,70)
#impute the other columns with missing values with their corresponding means
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed=imputer.fit_transform(df_x)
df_x=pd.DataFrame(imputed,columns=df_x.columns)
#standardize the data
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
x=scaler.fit_transform(df_x)
#convert labels into dummy variables (to be one hot encoded)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded_y = encoder.fit_transform(df_y)
print(encoder.classes_)
#encode the target feature into 4 one hot encoded features
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(sparse=False)
ohe_y=ohe.fit_transform(encoded_y.reshape(-1,1))
print(ohe.categories_)
return x,ohe_y,encoded_y

Building the Neural Network

We will use Tensorflow’s Keras library to build our Neural Network. The Neural Network will have 5 hidden layers with number of neurons in each layers is 512, 256, 128, 64 and 32. We also added dropout layers to the network to reduce overfitting. The output layer will have 4 neurons (same as the number of classes) and we will apply the softmax activation function to each output neuron to get the probability of each class.

Since it is a classification task, our loss function of choice is the categorical cross entropy. Adam will be used as optimizer.

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense,Dropout
#build the model in Sequential style
model=Sequential()
model.add(Dense(512,input_dim=(muon_hits_x_train_values.shape[1]),activation='relu'))
model.add(Dense(256,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(128,activation='relu'))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(32,activation='relu'))
model.add(Dropout(0.1))
# 4 output neurons to match the 4 classes
model.add(Dense(4,activation='softmax'))
#compile model with adam optimizer, categorical crossentropy loss while monitoring accuracy as metric
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

Let’s set some callbacks for the model including a Checkpoint callback and Early Stopping callback that will both monitor the model’s validation accuracy. We will use a 25% validation split, batch size of 500 and will train for 200 epochs. And now we can finally start training!

from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.utils import class_weight
#set filepath for saving best weights
filepath="classifier_weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
#set up callbacks
checkpoint1 = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
checkpoint2= EarlyStopping( monitor='val_accuracy', patience=3)
callbacks_list = [checkpoint1,checkpoint1]
#generate class weights to balance the target classes distribution
class_weights = class_weight.compute_class_weight('balanced',
np.unique(y_encoded_train),
y_encoded_train)
#start training the model
history = model.fit(muon_hits_x_train_values, pt_labels_ohe_y_train,
validation_split=0.25,
epochs=200,
batch_size=500,
callbacks=callbacks_list,
class_weight=classes_weights)

Now that we finished training our model, let’s take a look at the evolution of its performance over the training epochs.

classifier_train_loss
classifier_val_loss
classifier_acc

Figure 7: Evolution of classifier metrics over training epochs

It looks like we achieved an 84% accuracy on the validation set! That’s impressive! But if there is only one thing I have learned by heart from Machine Learning, it will be to never rely on accuracy only to judge the model performance. We need to go deeper to understand the case better.

Testing the Model

Let’s examine the performance of the model on the train set and the test set using Scikit-Learn’s classification report. It includes an array of important classification metrics including precision, recall and f1-score.

from sklearn.metrics import classification_report
#generate predictions for train data
y_train_pred=model.predict(muon_hits_x_train_values)
#select class with the highest probability
y_train_pred_best=np.argmax(y_train_pred,axis=1)
#generate classification report
train_report=classification_report(y_encoded_train,y_train_pred_best)
#generate predictions for test data
y_test_pred=model.predict(muon_hits_x_test_values)
#select the class with the highest probability
y_test_pred_best=np.argmax(y_test_pred,axis=1)
#generate classification report
test_report=classification_report(y_encoded_test,y_test_pred_best)
print("Train report \n",train_report)
print("Test report \n",test_report)

Train ReportPrecisionRecallf1-score
00.990.900.94
10.650.590.62
20.330.730.46
30.360.660.47
Macro Average0.580.720.62
Weighted Average0.890.840.86

Table 5: Train data evaluation report

Test ReportPrecisionRecallf1-score
00.990.900.94
10.640.580.61
20.320.700.44
30.330.610.43
Macro Average0.570.700.61
Weighted Average0.890.840.85

Table 6: Test data evaluation report

That’s impressive! It seems that adding class weights definitely reduced bias and we reached a weighted f1-score of 0.85 on the test data!

Experimenting with Regression

Instead of classifying the target label into four group, let us try to directly predict the q/pT value for each muon momentum, therefore framing the problem as a regression task.

We only changed the last layer of the model, and replaced the 4 neurons with softmax by one output neuron with a linear activation function.

Since it is a regression task, we will choose mean square error as our loss function , and similarly to the classification task, we will use Adam as optimizer.

Let’s take a look on the evolution of the regression model’s performance over training epochs:

regressor_loss

Figure 7: Evolution of the model train and validation loss over training epochs

Let’s examine the performance of the model on the train set and the test set. We will the root of the mean square error to get a better feel of the values:
from sklearn.metrics import mean_squared_error
# calculate square root of mse for train data
y_train_pred_reg=reg_model.predict(muon_hits_x_train_values)
print("train root mean squared error: ",mean_squared_error(pt_inverse_labels_train_values,y_train_pred_reg,squared=False))
# calculate square root of mse for test data
y_test_pred_reg=reg_model.predict(muon_hits_x_test_values)
print("test root mean squared error: ",mean_squared_error(pt_inverse_labels_test_values,y_test_pred_reg,squared=False))
view raw mse_eval.py hosted with ❤ by GitHub

The root mean squared error for the train data is 0.046 and for the test data is 0.047. Not bad at all! The values seems to be relatively acceptable in comparison to the target value, but definitely there is room for improvement.

Future Directions

We discovered that the classifier model was not very good in detecting minority classes. Therefore we need to look for more techniques to make the model less biased (such as up sampling minority classes or down sampling the majority class).

We can further improve the performance of our regression model by hyperparameter tuning through grid search.

Final Thoughts

In this project, we tried several deep learning approaches to accurately predict muon momentum. At first, we investigated the data properties before preprocessing the data for training. We then trained a classifier to predict muon momentum class and evaluated the performance of the classifier on new data using multiple classification metrics. Next, we trained a regression model to predict muon momentum values instead of their classes. Both approaches turned out to be successful, and we plan to incorporate more advanced techniques to reduce model bias and improve its performance.