Quantitative Evaluation and Domain Adaptation of Vision–Language Models for Mixed-Reality Interpretation of Indoor Environmental Computational Fluid Dynamics Visualizations

In built environmental design, incorporating building user participation and verifying indoor thermal performance at early design stages have become increasingly important. Although Computational Fluid Dynamics (CFD) analysis is widely used to predict indoor thermal environments, its results are difficult for non-expert stakeholders to interpret, even when visualized using Mixed Reality (MR). Interpreting CFD visualizations in MR requires quantitative reasoning that explicitly cross-references visual features with legend information, rather than relying on prior color–value associations learned from natural images. This study investigates the capability of Vision–Language Models (VLMs) to interpret MR visualizations of CFD results and respond to user queries. We focus on indoor temperature distributions and airflow velocities visualized in MR. A novel dataset was constructed, consisting of MR images with CFD results superimposed onto real indoor spaces, paired with domain-specific question–answer annotations requiring legend-based reasoning. Using this dataset, a general-purpose VLM (Qwen2.5-VL) was fine-tuned. Experimental results show that the baseline model achieved less than 30% accuracy, whereas fine-tuning improved accuracy to over 60% across all categories while largely preserving general reasoning performance. These results demonstrate that domain adaptation enables VLMs to quantitatively interpret physical information embedded in MR visualizations, supporting non-expert understanding in built environmental design.

Liked Liked