Submitted:
18 April 2025
Posted:
23 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Theoretical Foundations: Kolmogorov–Arnold Representation Theorem
2.1. Historical Context and Motivation
2.2. Formal Statement of the Theorem
- It demonstrates the expressive power of compositions of univariate functions.
- It provides a theoretical upper bound on the number of functional components needed to approximate any continuous function.
- It holds uniformly over the entire domain, assuming f is continuous [20].
2.3. Comparison with the Universal Approximation Theorem
- KART-based architectures may exhibit improved interpretability due to the modularity of univariate functions.
- Unlike UAT-based networks, which typically rely on large hidden dimensions and dense parameterization, KART-inspired architectures aim for a more sparse and structured decomposition [22].
- KART provides a direct path for neural networks to emulate explicit function decomposition, useful in tasks such as symbolic regression and scientific computing.
2.4. Challenges and Limitations of the Theorem
- The proof is non-constructive: while existence is guaranteed, the functions and are not analytically defined or easily derivable [23].
- The functions may not be smooth or differentiable, limiting direct applicability in gradient-based optimization [24].
- The original form assumes continuity but does not extend easily to broader function classes (e.g., piecewise continuous or stochastic functions) [25].
- The constants are fixed and problem-independent, which may limit adaptability in practical implementations.
2.5. Implications for Machine Learning
3. Kolmogorov–Arnold Network Architectures
3.1. Architectural Design Principles
3.2. Parametrization of Univariate Functions
- Piecewise Linear Functions (Spline-based): Functions are parameterized as linear interpolants over a fixed or learnable set of knots. This enables expressive yet smooth transformations with efficient gradient computation.
- Neural Subnetworks: Each is modeled by a small neural network, typically a multilayer perceptron with one hidden layer [33]. This introduces additional depth and nonlinearity at the cost of interpretability.
- Fourier or Wavelet Bases: Functions are expressed as sums of sinusoids or wavelets, which can be particularly effective for periodic or localized features.
3.3. Connectivity and Topology
- Dense (Fully Connected) KANs: Every unit in layer j receives inputs from all units in layer via univariate transformations. This is the most expressive and general form [35].
- Locally Connected KANs: Inspired by CNNs, connectivity is restricted to spatially or semantically adjacent units, allowing for parameter sharing and local inductive biases [36].
- Sparse KANs: Connections are limited by a predefined or learnable sparsity pattern to reduce complexity and improve generalization.
- Hierarchical or Modular KANs: Layers are organized into modules, each responsible for a subset of input features, suitable for high-dimensional structured data [37].
3.4. Regularization and Inductive Biases
- Smoothness Constraints: Penalizing the derivatives (e.g., via regularization on spline coefficients) enforces smooth univariate functions and prevents sharp transitions [39].
- Weight Decay on Functional Parameters: Analogous to standard networks, or penalties can be applied to the parameters governing the function representation.
- Sparsity-Inducing Penalties: Encouraging sparsity in connectivity or function usage leads to more interpretable and generalizable models.
- Monotonicity or Convexity Constraints: In certain applications (e.g., economics, physics), enforcing domain-specific structural constraints improves performance and alignment with known laws [40].
3.5. Training and Optimization
- Initialization: Initializing univariate functions as identity maps (i.e., ) can stabilize training by mimicking linear operations at the start.
- Gradient Flow: Since KANs avoid repeated affine transformations, gradient vanishing/explosion may be less severe, but care must still be taken with depth and function scaling.
- Batch Normalization and Residuals: These techniques can be integrated to improve convergence, though they must be adapted to the function-centric computation model [42].
- Adaptive Function Resolution: Increasing the resolution of spline bases or number of expansion terms during training allows for progressive refinement [43].
3.6. Architectural Variants and Extensions
- KANs with Multiplicative Interactions: Beyond additive aggregation, multiplicative and gated mechanisms can be introduced to model higher-order dependencies [44].
- KAN-Attention Hybrids: Integrating attention mechanisms with KANs enables dynamic routing and context-aware univariate transformations [45].
- KAN-Transformer Architectures: In sequential and structured domains, KAN layers have been proposed as replacements for MLP blocks in transformers.
- KANs for Graphs and Sets: Permutation-invariant KANs have been developed by applying shared univariate transformations and aggregating over node features [46].
3.7. Interpretability Advantages
4. Empirical Performance and Benchmarks
4.1. Function Approximation on Synthetic Datasets
- KANs achieve lower approximation error on benchmark functions such as the high-dimensional Sine, Ackley, and Rosenbrock functions, compared to standard MLPs and even specialized architectures like Fourier neural operators (FNOs).
- Due to their compositional inductive bias, KANs require fewer trainable parameters to reach the same approximation accuracy as deep ReLU networks [54].
- Smoothness priors imposed on univariate functions enhance generalization to unseen input regions, outperforming MLPs prone to overfitting [55].
- KANs exhibit improved extrapolation beyond the training domain, likely due to their modular and interpretable structure.
4.2. Symbolic Regression and Interpretable Modeling
- KANs can recover known analytic forms (e.g., ) when trained on sampled data, as the learned univariate functions closely match the underlying components.
- By analyzing the individual components, one can extract symbolic approximations via fitting splines or polynomials to the learned transformations.
- Compared to symbolic regression algorithms like Eureqa or genetic programming, KANs offer faster convergence and higher accuracy, albeit at the cost of a more "black-box" representation unless post-processed.
4.3. Scientific Machine Learning and PDE Solving
- The univariate composition architecture aligns naturally with separation of variables techniques common in analytical solutions [61].
- KANs outperform standard PINNs (physics-informed neural networks) in stability and convergence when modeling high-frequency or multi-scale phenomena.
- The explicit structure of KANs allows for domain-specific inductive biases, such as symmetry or monotonicity, to be easily enforced.
4.4. Image and Signal Processing Tasks
- On MNIST and Fashion-MNIST, KANs attain classification accuracy comparable to or slightly exceeding shallow CNNs, with fewer parameters [64].
- The univariate transformations can learn useful edge detectors and nonlinear intensity mappings, which are visualizable and interpretable [65].
- On more complex datasets like CIFAR-10, performance lags behind deep CNNs and transformers, though hybrid KAN-CNN architectures can partially bridge the gap [66].
- KANs can also be applied to 1D signal data (e.g., ECG, audio), where their modularity offers robustness to temporal distortions.
4.5. Generalization and Robustness
- KANs maintain stable performance with significantly fewer training samples, unlike deep MLPs that require large datasets to avoid overfitting.
- Under domain shift (e.g., train on , test on ), KANs exhibit smaller degradation in prediction accuracy [67].
- Adversarial robustness is improved in settings where perturbations follow smooth transformations, though KANs remain vulnerable to high-frequency adversarial noise unless regularized.
4.6. Computational Efficiency and Scalability
- Training time per epoch is generally higher than MLPs due to spline evaluation or complex basis functions, but convergence is often achieved in fewer epochs.
- GPU acceleration of univariate function evaluation remains an area of active optimization, with custom CUDA kernels or spline libraries under development.
- Memory usage is lower for shallow KANs, though large KANs with many high-resolution univariate functions can have increased footprint [70].
4.7. Ablation Studies and Sensitivity Analysis
- Removing univariate parameterization (i.e., using linear ) degrades performance significantly, confirming the necessity of learned nonlinear transformations.
- Imposing weight sharing among functions (i.e., identical transformations across inputs) improves regularization but reduces expressivity.
- Enforcing monotonicity or other priors via constrained optimization can help align learned models with known physical or semantic properties [71].
5. Interpretability and Analytical Insights
5.1. Functional Modularity and Decomposability
- Atomic functional elements: Each univariate transformation can be analyzed independently, providing localized understanding of how individual inputs contribute to the output.
- Additive transparency: The output of a KAN layer is an additive combination of interpretable components, making the contribution of each input dimension explicitly traceable [77].
- Compositional semantics: The composition of such functions across layers retains structure that can often be mapped to known functional forms (e.g., polynomial, logarithmic, trigonometric) [78].
5.2. Visualization of Learned Functions
- Function plots: Graphs of versus x reveal nonlinearities, inflection points, discontinuities, and saturation behaviors [79].
- Sensitivity curves: The derivative indicates how sensitive the output is to changes in a particular input, analogous to feature importance [80].
- Activation heatmaps: For networks with many univariate functions (e.g., in deeper layers), heatmaps can reveal global activation patterns across data batches.
5.3. Alignment with Domain Knowledge
- Monotonicity: Enforcing or verifying that is monotonic aligns with physical laws or economic constraints [82].
- Symmetry and periodicity: Learned functions that exhibit symmetry (e.g., ) or periodicity (e.g., sinusoidal behavior) can be validated against expected behavior.
- Dimensional reduction: Inputs whose associated functions converge to near-constant forms () can be deemed irrelevant, aiding in feature selection.
5.4. Symbolic Interpretation and Extraction
- Fitting analytic expressions: Post hoc fitting of the learned to known function classes (e.g., trigonometric, exponential) using symbolic regression [84].
- Basis projection: Projecting learned functions onto a fixed functional basis (e.g., orthogonal polynomials) and extracting dominant terms [85].
- Spline simplification: Approximating piecewise spline representations with simplified piecewise-linear or piecewise-analytic forms.
- Pruning and merging: Removing redundant or overlapping functional components to yield a compact, symbolic approximation.
5.5. Contrast with MLP Interpretability Techniques
- Gradient-based attribution (e.g., saliency maps, integrated gradients).
- Activation maximization and feature visualization [86].
- Layer-wise relevance propagation or SHAP values.
5.6. Cognitive and Neuroscientific Parallels
- Compositionality: The human brain is believed to understand complex concepts by composing simpler functions or primitives—a principle mirrored by KANs [89].
- Tuning curves: Neurons in sensory systems often respond to univariate stimuli (e.g., orientation, frequency) in smooth, non-linear ways, akin to learned functions [90].
- Dimensional disentanglement: KANs promote a representation where each dimension is transformed independently, which aligns with the cognitive notion of factorized representations.
5.7. Limitations and Open Challenges in Interpretability
- In deep KANs, compositional interactions across layers may obscure simple univariate explanations [92].
- Highly nonlinear functions can be difficult to interpret, especially when extrapolating beyond the training domain [93].
- For high-dimensional inputs, the sheer number of components can overwhelm human analysis unless automated summarization is applied [94].
- Current techniques for symbolic extraction remain heuristic and may not capture all nuances of learned behavior [95].
6. Theoretical Foundations and Approximation Guarantees
6.1. Kolmogorov’s Superposition Theorem
- Compositional sufficiency: Multivariate functions can be built from summations and compositions of univariate transformations.
- Dimensional reduction: No more than univariate functions are needed to represent any n-dimensional continuous function [100].
6.2. Arnold’s Refinement and Smoothness Constraints
- Smooth Superposition: If f is smooth, then the representation can be constructed such that the composing functions are also smooth to order k (subject to constraints) [101].
- Local linearity: The mappings can often be taken as affine locally, preserving interpretability and easing implementation in network form.
6.3. Approximation Theory and Universality of KANs
- KANs can approximate functions in norms under mild regularity assumptions.
- The convergence rate depends on the smoothness of the basis used to represent functions (e.g., B-splines, Gaussian RBFs).
6.4. Comparison with ReLU Networks
- Curse of dimensionality: Approximation accuracy deteriorates rapidly as input dimensionality increases [105].
- Non-smoothness: ReLU networks approximate smooth functions using non-smooth components, leading to artifacts and instability.
- Hard-to-interpret geometry: The activation region boundaries in ReLU networks are hyperplanes that intersect in complex ways.
6.5. Sample Complexity and Generalization Bounds
- Lower VC-dimension: Due to their constrained architecture, KANs often have a lower Vapnik–Chervonenkis (VC) dimension than deep MLPs with similar capacity [106].
- Capacity control: The effective complexity of a KAN can be directly controlled via the number and smoothness of the univariate functions.
- Regularization potential: Incorporating penalties on the curvature, total variation, or higher-order derivatives of provides a mechanism to enforce smoothness priors and control overfitting [107].
6.6. Convergence Behavior and Optimization Landscape
- Gradient stability: Learning univariate functions in isolation tends to stabilize gradients and reduce pathologies like exploding or vanishing gradients.
- Fewer local minima: The separation of variables reduces interactions among parameters, leading to a potentially simpler loss landscape [109].
- Initialization flexibility: KANs can be initialized using priors or analytic approximations (e.g., identity functions, sinusoids), providing better inductive bias than random MLP weights [110].
6.7. Limitations and Open Theoretical Questions
- Optimal architecture: The minimal number of univariate functions required for accurate approximation in real-world settings remains unclear.
- Extension to stochastic functions: There is limited theory on how KANs approximate stochastic processes or probability densities [111].
- Scalability to high dimensions: While KST avoids the curse of dimensionality in theory, practical KANs must still address parameter growth and training efficiency in high-n regimes.
- Approximation lower bounds: Existing results are mostly upper bounds; understanding the limits of what KANs cannot efficiently represent is an ongoing research direction.
7. Practical Considerations and Engineering Challenges
7.1. Representation of Univariate Functions
7.1.1. Spline Interpolation
- Advantages: Control over smoothness via knot density; easy gradient computation; strong approximation guarantees.
- Challenges: Knot placement becomes critical; overparameterization may lead to oscillatory behavior (Runge’s phenomenon) [115].
- Training: Parameters typically correspond to spline control points or coefficients, trained via backpropagation [116].
7.1.2. Neural Parameterizations
7.1.3. Fourier or Polynomial Basis
- Advantages: Compact representation; strong theoretical properties; often leads to symbolic interpretability.
- Challenges: Coefficient estimation may be ill-conditioned; basis choice may bias learning [121].
7.2. Initialization and Inductive Bias
- Identity initialization: Each encourages initial linear behavior and provides a natural prior.
- Analytic priors: Initializing using known domain-specific functions (e.g., , ) accelerates convergence.
- Random projections: Sampling from function spaces (e.g., Gaussian processes) introduces stochastic diversity.
7.3. Gradient Flow and Optimization
- Gradient locality: Since each function affects a limited subset of the output, gradient updates are localized, reducing cross-parameter interference [124].
- Stiffness: The learning dynamics of spline coefficients or basis expansions may exhibit stiffness, requiring adaptive optimizers (e.g., Adam, Ranger).
- Vanishing gradients: Shallow univariate transformations may saturate or flatten, necessitating activation-aware regularization.
7.4. Computational Overheads and Efficiency
- Function evaluation cost: Evaluating spline or Fourier expansions for each input dimension can be more expensive than a matrix multiplication [104].
- Memory usage: Storing high-resolution spline tables or basis coefficients increases memory consumption.
- Parallelism: Due to the inherently sequential nature of function composition, parallelization across batch and input dimensions is more limited than in matrix-based layers [125].
7.5. Integration into Deep Learning Frameworks
- Custom layers and modules: Developers must define spline/basis layers manually or use third-party libraries [127].
- Autograd compatibility: Gradient computation through custom univariate functions must be stable and differentiable.
- Tooling support: Visualization, logging, and checkpointing tools must accommodate non-standard layer structures.
7.6. Robustness, Regularization, and Generalization
- Smoothness penalties: Penalizing derivatives of controls overfitting and enforces functional simplicity.
- Structural dropout: Randomly deactivating input branches or basis components encourages redundancy and resilience.
- Domain alignment: Regularizing to match known physical or statistical constraints (e.g., monotonicity, convexity) improves robustness [129].
7.7. Deployment and Inference-Time Efficiency
- Function caching: Precomputing and caching univariate function outputs reduces online evaluation time [130].
- Model distillation: Converting a trained KAN into a simpler surrogate (e.g., an MLP or symbolic formula) allows faster inference.
- Quantization: Univariate functions can be quantized or compressed using table lookups or low-rank approximations.
7.8. Open Implementation Challenges
- Standardization: Lack of widely adopted KAN implementations hampers reproducibility and benchmarking [131].
- Visualization tools: Robust tools for visualizing learned univariate functions, composition hierarchies, and symbolic approximations are underdeveloped [132].
- Hyperparameter tuning: Choosing the number, type, and complexity of univariate functions remains more art than science.
- Scalability: Current implementations scale poorly to inputs with thousands of dimensions (e.g., genomics, text embeddings) [133].
8. Conclusion
- Expressive Power: KANs possess universal approximation capabilities grounded in rigorous mathematical theorems, with demonstrable advantages in approximating smooth, high-dimensional functions using a structured and interpretable composition of univariate functions.
- Interpretability and Sparsity: The separation of variable-wise nonlinearities allows for clearer attribution of model behavior to specific inputs, often yielding simpler and more interpretable representations than black-box neural networks.
- Empirical Performance: On a growing number of benchmarks in regression, scientific computing, and symbolic discovery, KANs have demonstrated competitive or superior performance with significantly fewer trainable parameters, especially in low-data regimes or under interpretability constraints.
- Engineering Trade-offs: Despite their promise, KANs introduce new complexities in function representation, optimization stability, and computational overhead, requiring careful architectural and algorithmic tuning to be competitive at scale.
- Ecosystem Integration: While integration with existing deep learning frameworks remains in early stages, emerging tools and libraries are beginning to support spline-based components, enabling wider adoption and experimentation.
References
- Ma, C. A unified framework for multiscale spectral generalized FEMs and low-rank approximations to multiscale PDEs. arXiv preprint, arXiv:2311.08761 2023.
- Shazeer, N. Glu variants improve transformer. arXiv preprint, arXiv:2002.05202 2020.
- Arnold, V.I. On functions of three variables. Doklady Akademii Nauk SSSR 1957, 114, 679–681. [Google Scholar]
- Yu, R.; Yu, W.; Wang, X. KAN or MLP: A Fairer Comparison. arXiv preprint, arXiv:2407.16674 2024.
- Chen, Z.; Gundavarapu.; DI, W. Vision-KAN: Exploring the Possibility of KAN Replacing MLP in Vision Transformer; https://github.com/chenziwenhaoshuai/Vision-KAN.git 2024.
- Song, J.; Liu, Z.; Tegmark, M.; Gore, J. A Resource Model For Neural Scaling Law. arXiv preprint, arXiv:2402.05164 2024.
- Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 1989, 2, 303–314. [Google Scholar] [CrossRef]
- Haykin, S. Neural networks: a comprehensive foundation; Prentice Hall PTR, 1994.
- Huang, Y. Improving Robustness of Deep Neural Networks with KAN-based Adversarial Training (KAT). IEEE Transactions on Artificial Intelligence 2024, 9, 130–145. [Google Scholar]
- Guo, H.; Li, F.; Li, J.; Liu, H. KANv.s. MLP for Offline Reinforcement Learning, 2024. arXiv:cs.LG/2409.09653.
- Jahin, M.A.; Masud, M.A.; Mridha, M.F.; Aung, Z.; Dey, N. KACQ-DCNN: Uncertainty-Aware Interpretable Kolmogorov-Arnold Classical-Quantum Dual-Channel Neural Network for Heart Disease Detection, 2024. arXiv:cs.LG/2410.07446].
- Jamali, A.; others. Advances in Kolmogorov-Arnold Networks for Data Fitting and PDE Solving. Journal of Computational Physics 2024, 423, 145–160. [Google Scholar]
- Yu, B.; others. The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics 2018, 6, 1–12. [Google Scholar]
- Xu, L.; others. Time-Kolmogorov-Arnold Networks and Multi-Task Kolmogorov-Arnold Networks for Time Series Prediction. Journal of Time Series Analysis 2024, 45, 200–220. [Google Scholar]
- Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural networks 1989, 2, 359–366. [Google Scholar] [CrossRef]
- Agarwal, R.; Melnick, L.; Frosst, N.; Zhang, X.; Lengerich, B.; Caruana, R.; Hinton, G.E. Neural additive models: Interpretable machine learning with neural nets. Advances in neural information processing systems 2021, 34, 4699–4711. [Google Scholar]
- Lahini, Y.; Pugatch, R.; Pozzi, F.; Sorel, M.; Morandotti, R.; Davidson, N.; Silberberg, Y. Observation of a localization transition in quasiperiodic photonic lattices. Physical review letters 2009, 103, 013901. [Google Scholar] [CrossRef]
- Kiamari, M.; others. Graph Kolmogorov-Arnold Networks: A Novel Approach for Node Classification. IEEE Transactions on Neural Networks and Learning Systems 2024, 35, 1450–1465. [Google Scholar]
- Hestness, J.; Narang, S.; Ardalani, N.; Diamos, G.; Jun, H.; Kianinejad, H.; Patwary, M.M.A.; Yang, Y.; Zhou, Y. Deep learning scaling is predictable, empirically. arXiv preprint, arXiv:1712.00409 2017.
- Xu, K.; Chen, L.; Wang, S. Kolmogorov-Arnold Networks for Time Series: Bridging Predictive Power and Interpretability. 2024; arXiv:cs.LG/2406.02496]. [Google Scholar]
- Yang, Z.; Zhang, J.; Luo, X.; Lu, Z.; Shen, L. Activation Space Selectable Kolmogorov-Arnold Networks. 2024; arXiv:cs.LG/2408.08338]. [Google Scholar]
- Aziznejad, S.; Unser, M. Deep spline networks with control of Lipschitz regularity. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3242–3246.
- Cheon, M. Kolmogorov-Arnold Network for Satellite Image Classification in Remote Sensing. arXiv preprint, arXiv:2406.00600 2024.
- Sitzmann, V.; Martel, J.; Bergman, A.; Lindell, D.; Wetzstein, G. Implicit neural representations with periodic activation functions. Advances in neural information processing systems 2020, 33, 7462–7473. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv preprint, arXiv:1606.08415 2016.
- Gao, Y.; Tan, V.Y.F. On the Convergence of (Stochastic) Gradient Descent for Kolmogorov–Arnold Networks. 2024; arXiv:cs.LG/2410.08041]. [Google Scholar]
- Hadash, G.; Kermany, E.; Carmeli, B.; Lavi, O.; Kour, G.; Jacovi, A. Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications. arXiv preprint, arXiv:1804.09028 2018.
- Kundu, A.; Sarkar, A.; Sadhu, A. Kanqas: Kolmogorov-Arnold Network for Quantum Architecture Search. arXiv preprint, arXiv:2406.17630 2024.
- Liu, Z.; Ma, P.; Wang, Y.; Matusik, W.; Tegmark, M. KAN 2.0: Kolmogorov-Arnold Networks Meet Science. arXiv preprint, arXiv:2408.10205 2024.
- Cunningham, H.; Ewart, A.; Riggs, L.; Huben, R.; Sharkey, L. Sparse autoencoders find highly interpretable features in language models. arXiv preprint, arXiv:2309.08600 2023.
- Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 2022, 35, 17359–17372. [Google Scholar]
- Gukov, S.; Halverson, J.; Ruehle, F. Rigor with machine learning from field theory to the Poincaréconjecture. Nature Reviews Physics 2024. [Google Scholar] [CrossRef]
- Ismayilova, A.; Ismailov, V.E. On the Kolmogorov neural networks. Neural Networks 2024, 106333. [Google Scholar] [CrossRef]
- Yang, X.; Wang, X. Kolmogorov-Arnold Transformer. 2024; arXiv:cs.LG/2409.10594]. [Google Scholar]
- Wang, Y.; Xia, X.; Zhang, L.; Yao, H.; Chen, S.; You, J.; Zhou, Q.; Liu, X.J. One-dimensional quasiperiodic mosaic lattice with exact mobility edges. Physical Review Letters 2020, 125, 196604. [Google Scholar] [CrossRef] [PubMed]
- Lai, M.J.; Shen, Z. The kolmogorov superposition theorem can break the curse of dimensionality when approximating high dimensional functions. arXiv preprint, arXiv:2112.09963 2021.
- Cranmer, K.; others. Discovering Symbolic Models from Deep Learning with Inductive Biases. Advances in Neural Information Processing Systems 2020, 33, 17429–17442. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30.
- Fukushima, K. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics 1969, 5, 322–333. [Google Scholar] [CrossRef]
- Kohler, M.; Langer, S. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics 2021, 49, 2231–2249. [Google Scholar] [CrossRef]
- SS, S.; AR, K.; R, G.; KP, A. Chebyshev Polynomial-Based Kolmogorov-Arnold Networks: An Efficient Architecture for Nonlinear Function Approximation, 2024. arXiv:cs.LG/2405.07200].
- Sprecher, D.A.; Draghici, S. Space-filling curves and Kolmogorov superposition-based neural networks. Neural Networks 2002, 15, 57–67. [Google Scholar] [CrossRef]
- Kashefi, A. PointNet with KAN versus PointNet with MLP for 3D Classification and Segmentation of Point Sets, 2024. arXiv:cs.CV/2410.10084].
- Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 702–703.
- Telgarsky, M. Neural networks and rational functions. International Conference on Machine Learning. PMLR, 2017, pp. 3387–3393.
- He, J.; Li, L.; Xu, J.; Zheng, C. ReLU deep neural networks and linear finite elements. arXiv preprint, arXiv:1807.03973 2018.
- Gukov, S.; Halverson, J.; Manolescu, C.; Ruehle, F. Searching for ribbons with machine learning, 2023, [arXiv:math.GT/2304.09304].
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- Girosi, F.; Poggio, T. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Computation 1989, 1, 465–469. [Google Scholar] [CrossRef]
- Mahara, A.; Rishe, N.D.; Deng, L. The Dawn of KAN in Image-to-Image (I2I) Translation: Integrating Kolmogorov-Arnold Networks with GANs for Unpaired I2I Translation, 2024. arXiv:cs.CV/2408.08216].
- Yu, R.C.; Wu, S.; Gui, J. Residual Kolmogorov-Arnold Network for Enhanced Deep Learning, 2024. arXiv:cs.CV/2410.05500].
- LeCun, Y.; Bengio, Y.; others. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 1995, 3361, 1995. [Google Scholar]
- Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. European conference on computer vision. Springer, 2022, pp. 280–296.
- Elhage, N.; Hume, T.; Olsson, C.; Nanda, N.; Henighan, T.; Johnston, S.; ElShowk, S.; Joseph, N.; DasSarma, N.; Mann, B.; Hernandez, D.; Askell, A.; Ndousse, K.; Jones, A.; Drain, D.; Chen, A.; Bai, Y.; Ganguli, D.; Lovitt, L.; Hatfield-Dodds, Z.; Kernion, J.; Conerly, T.; Kravec, S.; Fort, S.; Kadavath, S.; Jacobson, J.; Tran-Johnson, E.; Kaplan, J.; Clark, J.; Brown, T.; McCandlish, S.; Amodei, D.; Olah, C. Softmax Linear Units. Transformer Circuits Thread. 2022. https://transformer-circuits.pub/2022/solu/index.html.
- Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.; Hamprecht, F.; Bengio, Y.; Courville, A. On the spectral bias of neural networks. International conference on machine learning. PMLR, 2019, pp. 5301–5310.
- Yang, X.; Zhou, D.; Liu, S.; Ye, J.; Wang, X. Deep model reassembly. Advances in neural information processing systems 2022, 35, 25739–25753. [Google Scholar]
- Schmidt-Hieber, J. The Kolmogorov–Arnold representation theorem revisited. Neural networks 2021, 137, 119–126. [Google Scholar] [CrossRef]
- Kour, G.; Saabne, R. Real-time segmentation of on-line handwritten arabic script. Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on. IEEE, 2014, pp. 417–422.
- Liu, J.; others. Kolmogorov-Arnold Networks for Symbolic Regression and Time Series Prediction. Journal of Machine Learning Research 2024, 25, 95–110. [Google Scholar]
- Kovachki, N.; Li, Z.; Liu, B.; Azizzadenesheli, K.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Neural operator: Learning maps between function spaces with applications to pdes. Journal of Machine Learning Research 2023, 24, 1–97. [Google Scholar]
- Poeta, E.; Giobergia, F.; Pastor, E.; Cerquitelli, T.; Baralis, E. A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data, 2024. arXiv:cs.LG/2406.14529].
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; others. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
- Aghaomidi, P.; Wang, G. ECG-SleepNet: Deep Learning-Based Comprehensive Sleep Stage Classification Using ECG Signals, 2024. arXiv:cs.AI/2412.01929].
- Hecht-Nielsen, R. Kolmogorov’s mapping neural network existence theorem. Proceedings of the international conference on Neural Networks. IEEE press New York, NY, USA, 1987, Vol. 3, pp. 11–14.
- Meunier, D.; Lambiotte, R.; Bullmore, E.T. Modular and hierarchically modular organization of brain networks. Frontiers in neuroscience 2010, 4, 7572. [Google Scholar] [CrossRef]
- Inzirillo, H.; Genet, R. A Gated Residual Kolmogorov-Arnold Networks for Mixtures of Experts, 2024. arXiv:cs.LG/2409.15161].
- Qiu, R.; Miao, Y.; Wang, S.; Yu, L.; Zhu, Y.; Gao, X.S. PowerMLP: An Efficient Version of KAN, 2024. arXiv:cs.LG/2412.13571].
- Zniyed, Y.; Nguyen, T.P.; others. Efficient tensor decomposition-based filter pruning. Neural Networks 2024, 178, 106393. [Google Scholar]
- Walsh, J.L. Interpolation and approximation by rational functions in the complex domain; Vol. 20, American Mathematical Soc., 1935.
- Courant, R.; Friedrichs, K.; Lewy, H. On the partial difference equations of mathematical physics. Mathematische Annalen 1928, 100, 32–74. [Google Scholar] [CrossRef]
- Jin, J.; Li, X.; Huang, H.; Liu, L.; Sun, Y. PEP-GS: Perceptually-Enhanced Precise Structured 3D Gaussians for View-Adaptive Rendering, 2024. arXiv:cs.CV/2411.05731].
- Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6023–6032.
- Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; Anandkumar, A. Fourier neural operator for parametric partial differential equations. arXiv preprint, arXiv:2010.08895 2020.
- Poggio, T.; Banburski, A.; Liao, Q. Theoretical issues in deep networks. Proceedings of the National Academy of Sciences 2020, 117, 30039–30045. [Google Scholar] [CrossRef]
- Shukla, K.; Toscano, J.D.; Wang, Z.; Zou, Z.; Karniadakis, G.E. A comprehensive and FAIR comparison between MLP and KAN representations for differential equations and operator networks, 2024. arXiv:cs.LG/2406.02917].
- Wang, Y.; Siegel, J.W.; Liu, Z.; Hou, T.Y. On the expressiveness and spectral bias of KANs. arXiv preprint, arXiv:2410.01803 2024.
- Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function 2020.
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; others. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Elhage, N.; Hume, T.; Olsson, C.; Schiefer, N.; Henighan, T.; Kravec, S.; Hatfield-Dodds, Z.; Lasenby, R.; Drain, D.; Chen, C.; others. Toy models of superposition. arXiv preprint, arXiv:2209.10652. 2022.
- Ronen, B.; Jacobs, D.; Kasten, Y.; Kritchman, S. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems 2019, 32. [Google Scholar]
- Xu, H.; Sin, F.; Zhu, Y.; Barbič, J. Nonlinear material design using principal stretches. ACM Transactions on Graphics (TOG) 2015, 34, 1–11. [Google Scholar] [CrossRef]
- He, Y. Machine Learning in Pure Mathematics and Theoretical Physics; G - Reference,Information and Interdisciplinary Subjects Series, World Scientific, 2023.
- Polo-Molina, A.; Alfaya, D.; Portela, J. MonoKAN: Certified Monotonic Kolmogorov-Arnold Network, 2024. arXiv:cs.LG/2409.11078].
- Braun, J.; Griebel, M. On a constructive proof of Kolmogorov’s superposition theorem. Constructive Approximation 2009, 30, 653–675. [Google Scholar] [CrossRef]
- Zhang, S.; Zhang, P.; Hou, T.Y. Multiscale invertible generative networks for high-dimensional Bayesian inference. International Conference on Machine Learning. PMLR, 2021, pp. 12632–12641.
- Li, Z. Kolmogorov-Arnold Networks are Radial Basis Function Networks. ArXiv 2024, abs/2405.06721.
- Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv preprint, arXiv:2404.19756 2024.
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- Jacob, B.; Howard, A.A.; Stinis, P. SPIKANs: Separable Physics-Informed Kolmogorov-Arnold Networks, 2024. arXiv:cs.LG/2411.06286].
- Lu, L.; Jin, P.; Pang, G.; Zhang, Z.; Karniadakis, G.E. Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators. Nature machine intelligence 2021, 3, 218–229. [Google Scholar] [CrossRef]
- Wang, S.; Li, B.; Chen, Y.; Perdikaris, P. PirateNets: Physics-informed Deep Learning with Residual Adaptive Networks. arXiv preprint, arXiv:2402.00326 2024.
- Sun, Y.; Zhu, H.; Qin, C.; Zhuang, F.; He, Q.; Xiong, H. Discerning decision-making process of deep neural networks with hierarchical voting transformation. Advances in Neural Information Processing Systems 2021, 34, 17221–17234. [Google Scholar]
- Michaud, E.J.; Liu, Z.; Tegmark, M. Precision machine learning. Entropy 2023, 25, 175. [Google Scholar] [CrossRef]
- Boullé, N.; Nakatsukasa, Y.; Townsend, A. Rational neural networks. Advances in neural information processing systems 2020, 33, 14243–14253. [Google Scholar]
- Lin, H.W.; Tegmark, M.; Rolnick, D. Why does deep and cheap learning work so well? Journal of Statistical Physics 2017, 168, 1223–1247. [Google Scholar] [CrossRef]
- Wang, S.; Sankaran, S.; Perdikaris, P. Respecting causality is all you need for training physics-informed neural networks. arXiv preprint, arXiv:2203.07404 2022.
- Kolmogorov, A.N. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk. Russian Academy of Sciences, 1957, Vol. 114, pp. 953–956.
- Wang, H.; others. SpectralKAN: Spatial-Spectral Kolmogorov-Arnold Networks for Hyperspectral Image Classification. IEEE Transactions on Geoscience and Remote Sensing 2024, 62, 500–515. [Google Scholar]
- Goyal, M.; Goyal, R.; Lall, B. Learning activation functions: A new paradigm for understanding neural networks. arXiv preprint, arXiv:1906.09529 2019.
- Udrescu, S.M.; Tan, A.; Feng, J.; Neto, O.; Wu, T.; Tegmark, M. AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity. Advances in Neural Information Processing Systems 2020, 33, 4860–4871. [Google Scholar]
- Galitsky, B.A. Galitsky, B.A. Kolmogorov-Arnold Network for Word-Level Explainable Meaning Representation. Preprints 2024. Retrieved from https://www.preprints.org/manuscript/202405.1981.
- Braun, J.; Griebel, M. On a constructive proof of Kolmogorov’s superposition theorem. Constructive approximation 2009, 30, 653–675. [Google Scholar] [CrossRef]
- Kauffman, L.H.; Russkikh, N.E.; Taimanov, I.A. Rectangular knot diagrams classification with deep learning, 2020. arXiv:math.GT/2011.03498].
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv preprint, arXiv:1710.05941 2017.
- Deng, X.; He, X.; Bao, J.; Zhou, Y.; Cai, S.; Cai, C.; Chen, Z. MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement, 2025. arXiv:cs.CV/2411.18309].
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA; Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H.M.; Fergus, R.; Vishwanathan, S.V.N.; Garnett, R., Eds., 2017, pp. 5998–6008.
- Gordon, M.A.; Duh, K.; Kaplan, J. Data and Parameter Scaling Laws for Neural Machine Translation. ACL Rolling Review - May 2021, 2021.
- Bingham, G.; Miikkulainen, R. Discovering parametric activation functions. Neural Networks 2022, 148, 48–65. [Google Scholar] [CrossRef]
- Lagendijk, A.; Tiggelen, B.v.; Wiersma, D.S. Fifty years of Anderson localization. Physics today 2009, 62, 24–29. [Google Scholar] [CrossRef]
- Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Networks 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed]
- Gukov, S.; Halverson, J.; Ruehle, F.; Sułkowski, P. Learning to Unknot. Mach. Learn. Sci. Tech. 2021, 2, 025035. arXiv:math.GT/2010.16263. [CrossRef]
- Zhang, S.; Shen, Z.; Yang, H. Neural network architecture beyond width and depth. Advances in Neural Information Processing Systems 2022, 35, 5669–5681. [Google Scholar]
- Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. Proceedings of the European conference on computer vision (ECCV), 2018, pp. 418–434.
- Yu, W.; Si, C.; Zhou, P.; Luo, M.; Zhou, Y.; Feng, J.; Yan, S.; Wang, X. Metaformer baselines for vision. IEEE Transactions on Pattern Analysis and Machine Intelligence 2023. [Google Scholar] [CrossRef]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J. ; others. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint, arXiv:1906.07155 2019.
- Unser, M.; Aldroubi, A.; Eden, M. B-spline signal processing. I. Theory. IEEE transactions on signal processing 1993, 41, 821–833. [Google Scholar] [CrossRef]
- Wu, Y.; He, K. Group normalization. Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
- Nanda, N.; Chan, L.; Lieberum, T.; Smith, J.; Steinhardt, J. Progress measures for grokking via mechanistic interpretability. The Eleventh International Conference on Learning Representations, 2023.
- SpringerLink. On functions of three variables, 2023. Retrieved from https://link.springer.com/article/10.1007/BF01213206.
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv preprint, arXiv:2001.08361 2020.
- Horner, W. A New Method of Solving Numerical Equations of all Orders, by Continuous Approximation. Abstracts of the Papers Printed in the Philosophical Transactions of the Royal Society of London. JSTOR, 1815, Vol. 2, pp. 117–117.
- Siegel, J.W. Optimal approximation rates for deep ReLU neural networks on Sobolev and Besov spaces. Journal of Machine Learning Research 2023, 24, 1–52. [Google Scholar]
- Ruehle, F. Data science applications to string theory. Phys. Rept. 2020, 839, 1–117. [Google Scholar] [CrossRef]
- Lin, J.N.; Unbehauen, R. On the realization of a Kolmogorov network. Neural Computation 1993, 5, 18–20. [Google Scholar] [CrossRef]
- Lu, A.; Feng, T.; Yuan, H.; Song, X.; Sun, Y. Revisiting Neural Networks for Continual Learning: An Architectural Perspective, 2024. arXiv:cs.LG/2404.14829].
- Zniyed, Y.; Nguyen, T.P.; others. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar]
- Zhou, X.C.; Wang, Y.; Poon, T.F.J.; Zhou, Q.; Liu, X.J. Exact new mobility edges between critical and localized states. Physical Review Letters 2023, 131, 176401. [Google Scholar] [CrossRef] [PubMed]
- Kolb, B.; Whishaw, I.Q. Brain plasticity and behavior. Annual review of psychology 1998, 49, 43–64. [Google Scholar] [CrossRef] [PubMed]
- Dubcáková, R. Eureqa: software review. Genetic Programming and Evolvable Machines 2011, 12, 173–178. [Google Scholar] [CrossRef]
- Cheon, M. Kolmogorov-Arnold Network for Satellite Image Classification in Remote Sensing, 2024. arXiv:cs.CV/2406.00600].
- Ruijters, D.; ter Haar Romeny, B.M.; Suetens, P. Efficient GPU-based texture interpolation using uniform B-splines. Journal of Graphics Tools 2008, 13, 61–69. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- Li, X.; Ganeshan, S.; Pixley, J.; Sarma, S.D. Many-body localization and quantum nonergodicity in a model with a single-particle mobility edge. Physical review letters 2015, 115, 186601. [Google Scholar] [CrossRef]
- Wang, Y.; Sun, J.; Bai, J.; Anitescu, C.; Eshaghi, M.S.; Zhuang, X.; Rabczuk, T.; Liu, Y. Kolmogorov Arnold Informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on Kolmogorov Arnold Networks, 2024. arXiv:cs.LG/2406.11045].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
