Train an single object tracking

I research about object tracking, and I have some problems training it,my network consists of feature extractor+LSTM network and classification.At the first frame of each video, It needs to use the position of the object in the text file ,and then track the object in the next frames.

For training the tracking model, it needs some videos and for every video there is a txt file that consists of coordinates of bounding boxes of each frame. It means that, for labeling the dataset, it needs two labels, one for identifying the video and the second for identifying the position of objects. it is needed to tell the network, when the first frame is entered, use the position of the object in the text file. My questions are:1- ُSince it is necessary, the position of object is used in the first frame of each video, How to tell the network that the first frame is entering?2- In classification or detection models, there are many images in one folder and each image has one label. But in tracking there are some folders and each folder has many images. It means that each image has two labels. How to train the model with these labels?