Domain generalization has led to remarkable achievements in Hyperspectral Image (HSI) classification. Inspired by contrastive language-image pre-training (CLIP), the language-aware domain generalization method has been explored to learn cross-domain-invariant representation. However, existing methods face some challenges: 1) The weak capacity to extract long-range contextual information and inter-class correlation. 2) Due to the inadequacies of the large-scale pre-training for HSI data, the spatial-spectral features of HSI and linguistic features can not be straightforwardly alignment. To address the above problems, a novel network has been proposed with a CLIP framework, which consists of an image encoder, based on an encoder-only transformer to obtain the global contextual information and inter-class correlation, a frozen text encoder, and a cross-attention mechanism, named Linguistic-Interact-with-Visual Engager (LIVE), enhances the interaction between two modalities. Extensive experiments demonstrating superior performance over state-of-the-art methods in HSI Domain Generalization with a CLIP framework.