Cuts do not work on features
Cutting with the build in method cut from the DataContainer class results in the right cuts for labels but not for features. Here is a simple example ( only works form Alex with access to root files ) :
import glob
import numpy as np
from swgo.data_loading import load_data
config = "A1"
datadir = "/home/wecapstor3/capn/mppi110h/M5/" + config
fnames_gamma = glob.glob("%s/gamma/*.root" % datadir)[:10]
data = load_data(fnames_gamma, hit_threshold=25, pmts_per_tank=2)
feat_before_cuts = data.feat_dict()["feat"] # len(feat_before_cuts) = 1892
energy_before_cuts = data.label_dict()["mc_energy"] # len(energy_before_cuts) = 1892
print("Feat length", len(feat_before_cuts), "Label length", len(energy_before_cuts))
Print returns
Feat length 1892 Label length 1892
Now lets apply some simple cuts, for example core cuts.
r_core = np.linalg.norm(data.label_dict()["core"], axis=1)
mask_core = r_core < 300
data.cut(mask_core)
feat_after_cuts = data.feat_dict()["feat"]
energy_after_cuts = data.label_dict()["mc_energy"]
print("Feat length", len(feat_after_cuts), "Label length", len(energy_after_cuts))
Print returns
Feat length 1892 Label length 1153
Here we see that the labels get correctly cutted, however the length of the features remain the same. However only their length is not changed. The features, in data.feat
are changed, which we can see here:
print(data.feat['swgo_pc'].arr['feat'][0][:1])
print(data.feat['swgo_pc'].arr['feat'][1][:1])
print(data.feat['swgo_pc'].arr['feat'][2][:1])
print(data.feat['swgo_pc'].arr['feat'][3][:1])
print(data.feat['swgo_pc'].arr['feat'][4][:1])
print(mask_core[:5])
# 1 and 3 are the same, and 2,4 and 5 like the mask shows
Print returns
[[0. 0.57495224 0. 0.01314261]]
[[ 0. -1.7761285 0. 0.00289979]]
[[0. 0.57495224 0. 0.01314261]]
[[ 0. -1.7761285 0. 0.00289979]]
[[ 0. -1.7761285 0. 0.00289979]]
[False True False True True]
This is probably caused by this:
def mask(self, mask):
"""
Masks the data based on the provided mask.
Args:
- mask: The mask to apply.
Returns:
tuple: Tuple containing feature and label dicts with applied cuts.
"""
print("DataSet ", self.name, "mask %i/%i events\n" % (mask.sum(), mask.shape[0]))
for k, vals in self.feat.items():
self.feat[k].arr = vals[mask]
which is the mask method in the DataContainer class. Here vals, which is masked. But vals is of type PointCloud and if one look in the method of getitem:
def __getitem__(self, idx):
"""
Returns a subset of the point cloud specified by the index.
Parameters:
- idx: The index or indices to retrieve the subset.
Returns:
dict: Subset of the point cloud.
"""
if self.is_sparse is False: # then type(arr) = np.arr -> use np fancy indexing
return {k: val[idx] for k, val in self.dict.items()}
else:
if type(idx) == int or type(idx) == slice:
return {k: val[idx] for k, val in self.dict.items()}
else:
return {k: [val[i] for i in idx] for k, val in self.dict.items()}
this leads to indexing only the zeroth and first entry.
I do not really understand this construction: data.feat['swgo_pc'].arr['feat']
, however I have seen in while looking where the error might come from.
I think it would be nice to cut in the data.feat_dict()
, but the items are lists and slicing in lists is not possible, at least without looping through. A workaround I have used is to use a "poor man" cut like this:
Xs_cut = [X for (i, X) in enumerate(Xs) if mask_core[i]]