Skin-TAIDE: Development of TAIDE Multimodal Models using Retrieval-Augmented Generation and Fine-Tuning Approaches for Generating Traditional Chinese Diagnosis Description of Skin Lesion
Purpose: With the growing interest in multimodal large language models (MLLMs) for medical image analysis, expanding the application scope of the unimodal TAIDE large-scale language model has emerged as a prominent and significant research direction. Methods: This study employed the SkinCAP multimodal dataset, which consists 4,000 images of skin lesions along with their associated textual descriptions. Two approaches for model training and evaluation are proposed: (1) A visual retrieval-augmented generation (RAG) method, which leverages transfer learning for image feature extraction and cosine similarity for image retrieval. Retrieved results are used to generate prompts that guide the TAIDE model to produce diagnostic descriptions in traditional Chinese. (2) A fine-tuning-based method that integrates the MiniGPT-V2 framework with the TAIDE model to develop a multimodal system capable of automatically generating diagnostic descriptions. Results: Model performance was evaluated using BLEU, ROUGE-L, METEOR, CIDEr, and SPICE metrics. The results demonstrate that the fine-tuning-based approach—integrating MiniGPT-V2 with the TAIDE model—achieves superior performance compared to the visual RAG-based method, which combines transfer learning-based retrieval with the TAIDE model for description generation. Conclusion: This study presents an empirical comparison of two methodologies for extending unimodal large language models into multimodal applications for the automatic generation of diagnostic descriptions of skin lesions. The findings provide valuable technical insights and serve as a reference for the development of future AI-based medical systems.