Deep convolutional neural networks (CNNs) have been widely used in various medical imaging tasks. However, due to the intrinsic locality of convolution operation, CNNs generally cannot model long-range dependencies well, which are important for accurately identifying breast cancer from unregistered multi-view mammograms. This motivates us to leverage the architecture of Vision Transformers to capture long-range relationships of multiple mammograms from the same patient. For this purpose, we employ local Transformer blocks to separately learn patch relationships within a specific view (CC/MLO) of one side (right/left) of mammogram. The outputs from different views and sides are concatenated and fed into global Transformer blocks, to jointly learn patch relationships between two different views of the left and right breasts. To evaluate the proposed model, we retrospectively assembled a dataset involving 949 sets of CC and MLO view mammograms, which include 470 malignant cases and 479 normal or benign cases. We trained and evaluated the model using a five-fold cross-validation method. Without any arduous preprocessing steps (e.g., optimal window cropping, chest wall or pectoral muscle removal, two-view image registration, etc.), our two-view Transformer-based model achieves lesion classification accuracy of 77.0% and area under ROC curve (AUC = 0.814), outperforming state-of-the-art multi-view CNNs by 3.1% and 3%, respectively. Meanwhile, the new two-view model improves mammographic case classification accuracies of two single-view models by 7.4% (CC) and 4.5% (MLO), respectively. The promising results unveil the great potential of using Transformers to develop high-performing computer-aided diagnosis (CADx) schemes of mammography.