This research aims to explore the integration method of deep learning and large language models in speech recognition to improve the system’s recognition accuracy and ability to handle complex contexts. Deep neural network (DNN), convolutional neural network (CNN), long short-term memory network (LSTM) and Transformer-based large language model are used to build an integrated acoustic and language model framework. Experiments on TIMIT, LibriSpeech and Common Voice datasets show that the ensemble model shows significant improvements in both word error rate (WER) and real-time factor (RTF) compared to traditional models. Especially in terms of adaptability to multiple languages and accent changes, the model shows superior performance. The conclusion shows that through technology integration, the performance of the speech recognition system in complex environments can be effectively improved, providing a new direction for the future development of speech recognition technology.