The detection of muons is an important task in the CMS experiment at CERN (European Organization for Nuclear Research). Muons are usually expected to be produced by many physical experiments studied in depth at the CMS physics program at CERN. For example, one way to study the famous Higgs Boson particle is through a decay channel where the Higgs Boson decays into four muons.
Since muons can penetrate several meters of iron with no significant energy loss, they are hardly stopped by any layer of the CMS system. Therefore, the muon detectors are the outermost ones, far from the interaction point. The current muon transverse momentum detector, the Barrel Muon Track Finder, uses Lookup Tables to estimate the transverse momentum of the muons. The latter is measured using the pseudo-rapidity eta, a spatial quantity related to the angle at which the charged particles emerge from these collisions, and the azimuth angle phi, the angle that the collision paths make with the z-axis. (Source: CMS Level-1 Trigger Muon Momentum assignment with Machine Learning - Tommaso Diotalevi)

Figure 1: Adopted Coordinate System
Any detection of the muon at this level is called a trigger.
Problem Description
Much more sophisticated techniques are needed to improve the precision of the momentum detection to reduce the trigger rates. Since the prompt muon spectrum follows an exponential distribution, any small improvement in the precision might significantly reduce the number of triggers, especially by reducing the number of low momentum muons misclassified as high momentum muons.
And that’s where machine learning comes into place. By applying smarter algorithms such as Neural Networks, we can improve the precision of the trigger systems to accurately detect the muons and their corresponding transverse momenta. (Source: CMS Level-1 Trigger Muon Momentum assignment with Machine Learning - Tommaso Diotalevi)
In the following article, we will try to use several Deep Learning approaches to accurately predict muon momentum. We will use Monte-Carlo simulated data from the Cathode Strip Chambers (CSC) at the CMS experiment. The dataset contains more than 3 million muon events generated using Pythia.
You can also check the Jupyter notebook we prepared for preprocessing and training the models here.
Exploring the Dataset
The data we retrieved is organized as follows:
There are 87 features: 84 related to the detectors and 3 serving as path variables
There are 12 detector planes inside the CSC and each has 7 features (12*7=84)
The features are organized in the following table:
Column Range 0-11 12-23 24-35 36-47 48-59 60-71 72-83 84-86 features Phi coordinate Theta coordinate Bending angle Time info Ring number Front/rear hit Mask X_road
Table 1: Features and their indices in the dataset
In addition to other features:
Column Range | 87 | 88 | 89 |
---|---|---|---|
features | Phi angle | Eta angle | q/pt (Muon Momentum) - target label |
Table 2: Additional Features and their indices in the dataset
Inside the muon chamber, there are multiple detector stations as it is shown in the figure below. These detector planes are organized in specific chambers.

Figure 2: Schematic view of a CMS quadrant (Source: CERN Document Server)
However, we only care about the muon hits in the CSC chambers and therefore features related to the CSC chambers only.
Let us first delete those unneeded features:
import numpy as np | |
import pandas as pd | |
def delete_not_csc(data): | |
#select the data from the variables list | |
muon_hits=data['variables'] | |
#define the list of features indices to drop | |
indices_to_del=[] | |
for i in range(muon_hits.shape[1]): | |
if(i%12>=5 and i<84): | |
indices_to_del.append(i) | |
#delete the features with matching indices | |
return np.delete(muon_hits,indices_to_del,axis=1) | |
#load the data into a numpy array | |
data=np.load("histos_tba.npz") | |
#delete unneeded features | |
muon_hits=delete_not_csc(data) |
Now that we deleted those features, we can next fit all the features into a compact Pandas data frame for easier data analysis later on.
Exploratory Data Analysis
Let’s start by taking a look on the data statistics using Pandas’ describe method:
Phi angle1 | Theta angle1 | Bend angle1 | Time1 | Ring1 | Front/Rear1 | |
---|---|---|---|---|---|---|
count | 870666.0 | 870666.0 | 870666.0 | 870666.0 | 870666.0 | 870666.0 |
mean | 2942.738281 | 62.888813 | -0.075262 | -0.029150 | 2.0 | 0.465420 |
std | 1061.839233 | 10.498766 | 19.846676 | 0.167572 | 0.0 | 0.499028 |
min | 690.000000 | 46.000000 | -81.000000 | -1.000000 | 2.0 | 0.000000 |
25% | 2032.000000 | 54.000000 | -14.000000 | 0.000000 | 2.0 | 0.000000 |
50% | 2937.000000 | 62.000000 | 0.000000 | 0.000000 | 2.0 | 0.000000 |
75% | 3853.000000 | 71.000000 | 14.000000 | 0.000000 | 2.0 | 1.000000 |
max | 4950.000000 | 88.000000 | 70.000000 | 0.000000 | 2.0 | 1.000000 |
Table 3: Statistics about features with missing values
A quick look reveals that some features such as Phi angle1, Theta angle1 and Bend angle1 have some missing values. Let’s find the exact percentage of null values for each feature:
import matplotlib.pyplot as plt | |
#generate the percentages of missing values sorted from the highest to lowest | |
null_perc=((muon_hits_df.isnull(). | |
sum()/muon_hits_df. | |
shape[0] | |
)*100).sort_values(ascending=False) | |
#create a figure object to specify the figure size and axes' labels | |
fig = plt.figure(figsize = (10,5)) | |
ax = fig.gca() | |
ax.set_xlabel("features names") | |
ax.set_ylabel("missing values percentage") | |
#generate the bar graph | |
null_perc.plot.bar(ax=ax) | |
plt.show() |

Figure 3: Missing values distribution
We notice that Front/Rear1, Phi angle1, Theta angle1, Bend angle1, Time1 and Ring1 features have more than 70% missing values! We need to remove those features later on, and for the other features with less than 25% of missing values, we will fill the missing values with each column’s mean.
Next, we will visualize the data distribution of all features.
import matplotlib.pyplot as plt | |
#generate the percentages of missing values sorted from the highest to lowest | |
null_perc=((muon_hits_df.isnull(). | |
sum()/muon_hits_df. | |
shape[0] | |
)*100).sort_values(ascending=False) | |
#create a figure object to specify the figure size and axes' labels | |
fig = plt.figure(figsize = (10,5)) | |
ax = fig.gca() | |
ax.set_xlabel("features names") | |
ax.set_ylabel("missing values percentage") | |
#generate the bar graph | |
null_perc.plot.bar(ax=ax) | |
plt.show() |

Figure 4: Selected features histograms
We notice as shown in the figure above that some features such as Theta angle1, Theta angle2, Theta angle3 and X Road1 need to be transformed into Normal distributions (using Standardization) if we want to improve the learning speed of the model.
Data Preprocessing for Training
Framing the Problem as a Classification Task
Instead of trying to apply regression to predict the muon momenta values (q/pt), let’s first try to cluster the momenta into 4 classes: 0-10 GeV, 10-30 GeV, 30-100 GeV and >100 GeV, and hence frame the problem as a classification task.
In order to group the momenta, we will use the absolute value of the reciprocal of the target feature (pt/q).

Figure 5: The data distribution of the target feature
As shown in figure, it is clear that there are distinct groups of different magnitude. We can use the cut method from Pandas to cluster the muon momenta:
#generate p_T feature | |
pt_inverse_labels=muon_hits_df["q/pt"] | |
pt_labels=abs((1/pt_inverse_labels)) | |
#generate interval bins to group the target feature | |
bins = pd.IntervalIndex.from_tuples([(0, 10), (10, 30), (30, 100),(100,pt_labels.max()+1)]) | |
#use the cut function to group values within the same range to the same interval bin | |
org_pt_labels_groups=pd.cut(pt_labels, bins) | |
#check the distribution of the newly generated groups in bar graph | |
target_counts=org_pt_labels_groups.value_counts() | |
target_counts.plot.bar() | |
plt.show() |

Figure 6: The data distribution of the target classes
Now we have successfully grouped the labels!
We can observe that the classes are not balanced, therefore we need to balance those classes whenever we start training our Neural Network to avoid any bias.
Splitting the Dataset into Train and Test sets
Next, let’s separate the data into train and test data, we will use a 90%-10% split.
It is important to make the test set representative of the target classes distribution in the original dataset. Therefore, instead of randomly splitting our original dataset, we will use the StratifiedShuffleSplit method in Scikit-Learn to make both the train data and test data representative of the original distribution of classes.
import sklearn | |
from sklearn.model_selection import StratifiedShuffleSplit | |
#drop the target feature from the dataset | |
muon_hits_df_x=muon_hits_df.drop("q/pt",axis=1) | |
#create the StratifiedShuffleSplit object | |
split = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=42) | |
#split the original dataset into train and test sets | |
for train_index, test_index in split.split(muon_hits_df_x, org_pt_labels_groups): | |
#use train indices to get the train set | |
muon_hits_x_train_set = muon_hits_df_x.loc[train_index] | |
pt_labels_groups_y_train_set=org_pt_labels_groups[train_index] | |
#use test indices to get the test set | |
muon_hits_x_test_set = muon_hits_df_x.loc[test_index] | |
pt_labels_groups_y_test_set=org_pt_labels_groups[test_index] |
Let’s check the classes distribution in the newly generated sets:
Classes | Original | Train | Test |
---|---|---|---|
(0.0, 10.0] | 77.201062 | 77.201058 | 77.201094 |
(10.0, 30.0] | 15.207951 | 15.207942 | 15.208031 |
(30.0, 100.0] | 5.331871 | 5.331862 | 5.331948 |
(100.0, 6990.34] | 2.259117 | 2.259138 | 2.258927 |
Table 4: Comparing the classes distribution between the original, train and test sets
As the previous table shows, both the train set and test set are representative of the original dataset. We can now safely proceed to the final step of preprocessing: preparing the pipeline.
Preparing the Preprocessing Pipeline
This pipeline will perform the following tasks:
Delete columns with missing values more than 70%
Impute the other columns with missing values with their corresponding means using Scikit-Learn Simple Imputer
Standardize the data with Scikit-Learn Standard Scaler
Encode the target feature into four one hot encoded features
Here is the pipeline full code:
def preprocess_pipeline(df_x,df_y): | |
#delete columns with missing values more than 70% | |
df_x=delete_columns(df_x,70) | |
#impute the other columns with missing values with their corresponding means | |
from sklearn.impute import SimpleImputer | |
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') | |
imputed=imputer.fit_transform(df_x) | |
df_x=pd.DataFrame(imputed,columns=df_x.columns) | |
#standardize the data | |
from sklearn.preprocessing import StandardScaler | |
scaler=StandardScaler() | |
x=scaler.fit_transform(df_x) | |
#convert labels into dummy variables (to be one hot encoded) | |
from sklearn.preprocessing import LabelEncoder | |
encoder = LabelEncoder() | |
encoded_y = encoder.fit_transform(df_y) | |
print(encoder.classes_) | |
#encode the target feature into 4 one hot encoded features | |
from sklearn.preprocessing import OneHotEncoder | |
ohe=OneHotEncoder(sparse=False) | |
ohe_y=ohe.fit_transform(encoded_y.reshape(-1,1)) | |
print(ohe.categories_) | |
return x,ohe_y,encoded_y |
Building the Neural Network
We will use Tensorflow’s Keras library to build our Neural Network. The Neural Network will have 5 hidden layers with number of neurons in each layers is 512, 256, 128, 64 and 32. We also added dropout layers to the network to reduce overfitting. The output layer will have 4 neurons (same as the number of classes) and we will apply the softmax activation function to each output neuron to get the probability of each class.
Since it is a classification task, our loss function of choice is the categorical cross entropy. Adam will be used as optimizer.
from tensorflow.keras import Sequential | |
from tensorflow.keras.layers import Dense,Dropout | |
#build the model in Sequential style | |
model=Sequential() | |
model.add(Dense(512,input_dim=(muon_hits_x_train_values.shape[1]),activation='relu')) | |
model.add(Dense(256,activation='relu')) | |
model.add(Dropout(0.1)) | |
model.add(Dense(128,activation='relu')) | |
model.add(Dense(64,activation='relu')) | |
model.add(Dropout(0.1)) | |
model.add(Dense(32,activation='relu')) | |
model.add(Dropout(0.1)) | |
# 4 output neurons to match the 4 classes | |
model.add(Dense(4,activation='softmax')) | |
#compile model with adam optimizer, categorical crossentropy loss while monitoring accuracy as metric | |
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy']) |
Let’s set some callbacks for the model including a Checkpoint callback and Early Stopping callback that will both monitor the model’s validation accuracy. We will use a 25% validation split, batch size of 500 and will train for 200 epochs. And now we can finally start training!
from tensorflow.keras.callbacks import ModelCheckpoint | |
from tensorflow.keras.callbacks import EarlyStopping | |
from sklearn.utils import class_weight | |
#set filepath for saving best weights | |
filepath="classifier_weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5" | |
#set up callbacks | |
checkpoint1 = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max') | |
checkpoint2= EarlyStopping( monitor='val_accuracy', patience=3) | |
callbacks_list = [checkpoint1,checkpoint1] | |
#generate class weights to balance the target classes distribution | |
class_weights = class_weight.compute_class_weight('balanced', | |
np.unique(y_encoded_train), | |
y_encoded_train) | |
#start training the model | |
history = model.fit(muon_hits_x_train_values, pt_labels_ohe_y_train, | |
validation_split=0.25, | |
epochs=200, | |
batch_size=500, | |
callbacks=callbacks_list, | |
class_weight=classes_weights) |
Now that we finished training our model, let’s take a look at the evolution of its performance over the training epochs.



Figure 7: Evolution of classifier metrics over training epochs
It looks like we achieved an 84% accuracy on the validation set! That’s impressive! But if there is only one thing I have learned by heart from Machine Learning, it will be to never rely on accuracy only to judge the model performance. We need to go deeper to understand the case better.
Testing the Model
Let’s examine the performance of the model on the train set and the test set using Scikit-Learn’s classification report. It includes an array of important classification metrics including precision, recall and f1-score.
from sklearn.metrics import classification_report | |
#generate predictions for train data | |
y_train_pred=model.predict(muon_hits_x_train_values) | |
#select class with the highest probability | |
y_train_pred_best=np.argmax(y_train_pred,axis=1) | |
#generate classification report | |
train_report=classification_report(y_encoded_train,y_train_pred_best) | |
#generate predictions for test data | |
y_test_pred=model.predict(muon_hits_x_test_values) | |
#select the class with the highest probability | |
y_test_pred_best=np.argmax(y_test_pred,axis=1) | |
#generate classification report | |
test_report=classification_report(y_encoded_test,y_test_pred_best) | |
print("Train report \n",train_report) | |
print("Test report \n",test_report) |
Train Report | Precision | Recall | f1-score |
---|---|---|---|
0 | 0.99 | 0.90 | 0.94 |
1 | 0.65 | 0.59 | 0.62 |
2 | 0.33 | 0.73 | 0.46 |
3 | 0.36 | 0.66 | 0.47 |
Macro Average | 0.58 | 0.72 | 0.62 |
Weighted Average | 0.89 | 0.84 | 0.86 |
Table 5: Train data evaluation report
Test Report | Precision | Recall | f1-score |
---|---|---|---|
0 | 0.99 | 0.90 | 0.94 |
1 | 0.64 | 0.58 | 0.61 |
2 | 0.32 | 0.70 | 0.44 |
3 | 0.33 | 0.61 | 0.43 |
Macro Average | 0.57 | 0.70 | 0.61 |
Weighted Average | 0.89 | 0.84 | 0.85 |
Table 6: Test data evaluation report
That’s impressive! It seems that adding class weights definitely reduced bias and we reached a weighted f1-score of 0.85 on the test data!
Experimenting with Regression
Instead of classifying the target label into four group, let us try to directly predict the q/pT value for each muon momentum, therefore framing the problem as a regression task.
We only changed the last layer of the model, and replaced the 4 neurons with softmax by one output neuron with a linear activation function.
Since it is a regression task, we will choose mean square error as our loss function , and similarly to the classification task, we will use Adam as optimizer.
Let’s take a look on the evolution of the regression model’s performance over training epochs:

Figure 7: Evolution of the model train and validation loss over training epochs
from sklearn.metrics import mean_squared_error | |
# calculate square root of mse for train data | |
y_train_pred_reg=reg_model.predict(muon_hits_x_train_values) | |
print("train root mean squared error: ",mean_squared_error(pt_inverse_labels_train_values,y_train_pred_reg,squared=False)) | |
# calculate square root of mse for test data | |
y_test_pred_reg=reg_model.predict(muon_hits_x_test_values) | |
print("test root mean squared error: ",mean_squared_error(pt_inverse_labels_test_values,y_test_pred_reg,squared=False)) |
The root mean squared error for the train data is 0.046 and for the test data is 0.047. Not bad at all! The values seems to be relatively acceptable in comparison to the target value, but definitely there is room for improvement.
Future Directions
We discovered that the classifier model was not very good in detecting minority classes. Therefore we need to look for more techniques to make the model less biased (such as up sampling minority classes or down sampling the majority class).
We can further improve the performance of our regression model by hyperparameter tuning through grid search.
Final Thoughts
In this project, we tried several deep learning approaches to accurately predict muon momentum. At first, we investigated the data properties before preprocessing the data for training. We then trained a classifier to predict muon momentum class and evaluated the performance of the classifier on new data using multiple classification metrics. Next, we trained a regression model to predict muon momentum values instead of their classes. Both approaches turned out to be successful, and we plan to incorporate more advanced techniques to reduce model bias and improve its performance.