[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models

### What is the problem this feature will solve?

Current popular video training methods operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid. This leads to suboptimal accuracy-computation trade-offs due to inherent video redundancy. Additionally, these models lack adaptability to varying computational budgets for downstream tasks, hindering the application of competitive models in real-world scenarios with limited resources.

### What is the feature?

The feature request is to add the FluxViT model, as described in the paper "Make Your Training Flexible: Towards Deployment-Efficient Video Models". FluxViT introduces a new test setting, "Token Optimization," which maximizes input information across different computational budgets by optimizing the set of input tokens through token selection from more suitably sampled videos. It utilizes a novel augmentation tool called "Flux" which makes the sampling grid flexible and leverages token selection. Integrating Flux into video training frameworks boosts model robustness with minimal additional cost. The paper demonstrates that FluxViT achieves state-of-the-art results across various video understanding tasks with standard costs and can match the performance of previous state-of-the-art models with significantly reduced computational cost (e.g., using only 1/4 tokens).

### What alternatives have you considered?

The paper discusses alternatives like token reduction on densely sampled tokens and existing methods for flexible network training that operate at different spatial or temporal resolutions. However, it argues that these approaches are suboptimal as they either suffer from performance degradation with significant reduction rates or fail to optimize token capacity utilization under computational constraints.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

What is the problem this feature will solve?

What is the feature?

What alternatives have you considered?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Add FluxViT Model - Towards Deployment-Efficient Video Models #2902

Description

What is the problem this feature will solve?

What is the feature?

What alternatives have you considered?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions