LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
-
Updated
Jun 20, 2025 - Python
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Kernel Tuner
Triton implementation of FlashAttention2 that adds Custom Masks.
Implementation of ConjugateGradients method using C and Nvidia CUDA
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
This repository contains examples CUDA usage in Cython code.
This is a cross-chip platform collection of operators and a unified neural network library.
Code repository for ICLR 2025 paper "LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid"
Pytorch implementation of a message passing neural network with RNN sub-units
Object Tracking using GPU acceleration.
This project integrates a custom CUDA-based matrix multiplication kernel into a PyTorch deep learning model, leveraging GPU acceleration for matrix operations. The goal is to compare the performance of this custom kernel with PyTorch's built-in matrix multiplication and demonstrate how custom CUDA kernels can optimize compute-intensive operations.
Spiral's Machine Learning Library
KernelHeim – development ground of custom Triton and CUDA kernel functions designed to optimize and accelerate machine learning workloads on NVIDIA GPUs. Inspired by the mythical stronghold of the gods, KernelHeim is a forge where high-performance kernels are crafted to unlock the full potential of the hardware.
SParry is a shortest path calculating Python tool using some algorithms with CUDA to speedup.
GPU porgamming CUDA is the repo that has all the list of my materials that I used for the CUDA . I learned CUDA myself and this material helped me get the basic strong .
CNNs for spectrogram-based music recommendation (Undergraduate dissertation)
The provided code is a Python script that uses the CuPy library to perform optimized GPU operations, specifically matrix multiplication. The script includes a custom CUDA kernel that is optimized for performance and energy consumption. The kernel uses half-precision floating-point numbers (float16) for improved performance and warp utilization.
High-performance 2D Quantum Dot (QD) Simulator implemented in C++ and Python
A collection of ultra-simple yet high-performance CUDA kernels.
Add a description, image, and links to the cuda-kernels topic page so that developers can more easily learn about it.
To associate your repository with the cuda-kernels topic, visit your repo's landing page and select "manage topics."