Facial Expression Recognition in Anime and Manga Characters: A Comparative Study of Vision Transformers and Convolutional Neural Networks

Facial expression recognition (FER) is a well-established task in computer vision, yet its application to non-photorealistic domains, such as anime and manga, remains largely underexplored. The stylized, exaggerated, and often non-proportional facial features of illustrated characters present unique challenges for deep learning models trained predominantly on realistic imagery. In this work, we construct a balanced dataset of 3,000 manga and anime face images spanning six emotion categories (Angry, Embarrassed, Happy, Psycho-Crazy, Sad, Scared) and conduct a systematic comparison of two major deep learning paradigms: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Specifically, We evaluate ResNet-18, ResNet-50, ViT-B/16, and ViT-S/16 under four fine-tuning strategies: linear probing, partial fine-tuning, full fine-tuning, and progressive unfreezing; enabling a controlled comparison of both architectural families and transfer learning depth. Our results show that fine-tuning strategy significantly impacts performance: the best configuration (ViT-B/16 with progressive unfreezing) achieves 80.89% test accuracy, compared to 61.33% for the weakest linear probe baseline (ViT-S/16), a gap of 19.56 percentage points. Vision Transformers benefit disproportionately from fine-tuning, and the relative ranking of architectures changes across fine-tuning regimes. Confusion matrix analysis reveals persistent cross-class confusion between visually similar emotions (e.g., Happyvs. Embarrassed), while highly distinctive categories such as Psycho-Crazy are consistently well recognized across all architectures.

Liked Liked