Train an object tracking network

hello everyboby,
I want to write as single object tracking with deep learning. but I have some problems.
ٌWhen I want to train a classification model, I have several images in different classes and assign labels to them according to the class name . but in tracking , we have some videos with their BB, so I dont know how can use this data for training,
after searching, I found that, I can train a object detection and then use it in the model, but when I want use LSTM network, to get temporal dependencies between frames of video, how can I train it?