Characters are one of the most important elements in composing digital animation. The appear-ance and voice of a character should be designed to express the personality and values of the character. However, it is not easy for animation producers to harmoniously match the appear-ance and voice of a character. Advances in deep learning technology have made it possible to overcome this limitation. To achieve this, firstly, an audio-visual dataset of characters is required. In this study, we construct and verify a Korean audio-visual dataset consisting of frontal face im-ages of various characters and short voice clips. We developed an application that can automati-cally extract the frontal face image and a short voice clip of a character by collecting videos up-loaded to YouTube. Through this, a dataset consisting of a total of 1,522 face images and a total of 7,999 seconds of voice clips was built based on 490 characters. Furthermore, we automatically la-bel characters by gender and age to validate the dataset. The dataset built in this study is expected to be used in various deep learning fields, such as classification, generative adversarial networks, and speech synthesis.