Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework that Enables Improved Robot Control and Video Generation

Salesforce AI research team present FOFPred, a language driven future optical flow prediction framework that connects large vision language models with diffusion transformers for dense motion forecasting in control and video generation settings. FOFPred takes one or more images and a natural language instruction such as ‘moving the bottle from right to left’ and predicts 4 future optical flow frames that describe how every pixel is expected to move over time.

https://arxiv.org/pdf/2601.10781

Future optical flow as a motion representation

Optical flow is the apparent per pixel displacement between two frames. FOFPred focuses on future optical flow, which means predicting dense displacement fields for future frames given only current observations and text, without access to future images at inference.

Future optical flow is a compact motion only representation. It removes static appearance and keeps only pixel level motion, so it is well suited as an intermediate state for robot control policies and as a conditioning signal for video diffusion models. Compared to predicting future RGB frames, it reduces the complexity of the output distribution and avoids modeling textures and high frequency details that are not required for motion planning.

To plug into existing latent diffusion infrastructure, the research team encode optical flow as RGB images. They map flow magnitude and direction from polar form into HSV channels, then convert to RGB. The scaling of each channel is tuned so that consecutive flow frames are visually smooth and resemble animated graphics. A standard Flux.1 variational autoencoder then encodes and decodes these flow images.

Unified VLM Diffusion backbone

FOFPred uses a unified architecture that combines a frozen vision language model, a frozen VAE and a trainable diffusion transformer. The pipeline is:

  • Qwen2.5-VL is used as the vision language encoder to jointly encode the caption and visual inputs.
  • Flux.1 VAE encodes the input images and the training optical flow targets into latent tensors.
  • An OmniGen style diffusion transformer, DiT, takes projected visual and textual features as conditional inputs and generates latent future flow sequences.

Only the DiT and small MLP projectors are trained. The Qwen2.5-VL and Flux.1 weights stay frozen, which lets the model reuse image editing pretraining and multimodal reasoning ability from prior work. Temporal modeling is added by extending the RoPE positional encoding and attention blocks from two dimensional spatial positions to full spatio-temporal positions across input and output frame sequences. This gives full spatio-temporal attention without adding extra parameters, so the DiT can reuse OmniGen image pretraining directly.

https://arxiv.org/pdf/2601.10781

Training on noisy web videos with relative optical flow

The core model is trained on web scale human activity videos with paired captions. The research team uses the Something Something V2 dataset and the EgoDex egocentric manipulation dataset to obtain around 500,000 video caption pairs.

Training uses an end to end flow matching objective in latent space. Future optical flow sequences are first computed offline, then encoded by the VAE and used as targets in a flow matching diffusion loss for the DiT. During training the method also applies classifier free guidance on both text and visual conditions and masks some frames and viewpoints to improve robustness.

A critical contribution is the relative optical flow calculation used to build clean training targets from noisy egocentric videos. For each frame pair the method:

  1. Computes dense optical flow with an off the shelf estimator.
  2. Estimates camera motion via homography using deep features.
  3. Uses projective geometry to subtract camera motion and obtain object centric relative flow vectors.
  4. Filters frame pairs by selecting those where the top k percent flow magnitudes exceed a threshold, which focuses training on segments with meaningful motion.

These steps are run offline at lower resolution for efficiency, then recomputed at original resolution for the final targets. The ablation study shows that static frame targets or raw flow without camera motion removal harm downstream performance, while disentangled relative flow targets give the best results.

https://arxiv.org/pdf/2601.10781

Language driven robot manipulation

The first downstream use case is robot control. FOFPred is finetuned on robot video caption data to predict future optical flow from both fixed and wrist mounted cameras. On top of FOFPred, the research team attach a diffusion policy network that takes predicted flow, text and robot state, and outputs continuous actions. This setup follows prior diffusion policy work but uses future optical flow instead of predicted RGB frames as the core representation.

On the CALVIN ABCD benchmark, which evaluates long horizon zero shot chains of 5 language specified manipulation tasks, FOFPred reaches an average chain length of 4.48. VPP reaches 4.33 and DreamVLA reaches 4.44 under the same protocol. FOFPred also attains a Task 5 success rate of 78.7 percent, which is the best among reported methods. In a low data setting with 10 percent of CALVIN demonstrations, FOFPred still reaches 3.43 average length, higher than the 3.25 of VPP.

On RoboTwin 2.0, a dual arm manipulation benchmark with 5 tasks that require both arms, FOFPred attains an average success rate of 68.6 percent. The VPP baseline reaches 61.8 percent under identical training settings. FOFPred improves success on every task in the subset.

https://arxiv.org/pdf/2601.10781

Motion aware text to video generation

The second downstream task is motion control in text to video generation. The research team build a two stage pipeline by connecting FOFPred with the Go with the Flow video diffusion model. FOFPred takes an initial frame and a language description of motion, predicts a sequence of future flow frames, and interpolates them into a dense motion field. Go with the Flow then uses this motion field and the initial frame to synthesize the final video, enforcing the described motion pattern.

On the motion heavy Something Something V2 benchmark, the FOFPred along with Go with the Flow pipeline improves over the CogVideoX baseline under identical conditions. The method reaches SSIM 68.4, PSNR 22.26, LPIPS 28.5, FVD 75.39, KVD 11.38, and motion fidelity 0.662, which are consistently better than CogVideoX. Importantly, FOFPred only uses language and a single frame at inference, while several controllable video baselines require hand or object masks or trajectories as extra inputs.

https://arxiv.org/pdf/2601.10781

Key Take aways

  1. FOFPred reframes motion prediction as language driven future optical flow, predicting 4 dense optical flow frames from one or more current images and a text instruction, which provides a compact motion only representation for downstream tasks.
  2. The model uses a unified VLM Diffusion backbone, with Qwen2.5-VL as a frozen vision language encoder, Flux.1-VAE as a frozen latent encoder for images and flow, and an OmniGen style DiT as the only trained component with spatio temporal RoPE based attention.
  3. Training relies on large scale web and egocentric video from Something Something-V2 and EgoDex, and builds relative optical flow targets by estimating ego-motion via homography, subtracting camera flow and filtering for high motion segments, which significantly improves downstream performance.
  4. In robot manipulation, FOFPred acts as a motion backbone for a diffusion policy head and achieves state of the art or better results on CALVIN ABCD and RoboTwin 2.0, including 4.48 average task chain length on CALVIN and 68.6 percent average success on RoboTwin, outperforming VPP and DreamVLA variants.
  5. For text to video generation, connecting FOFPred to Go with the Flow yields better SSv2 metrics than CogVideoX, with higher SSIM and PSNR, lower FVD and KVD, and improved motion fidelity, while requiring only language and a single frame at inference, making FOFPred a reusable motion controller for both robotics and video synthesis pipelines.

Check out the Paper, Model and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework that Enables Improved Robot Control and Video Generation appeared first on MarkTechPost.

Liked Liked