Daily Reports
Next report inZero-to-CAD 1M: A Million Synthetic, Executable CAD Programs for AI Generation
ADSKAILab/Zero-To-CAD-1m
Zero-to-CAD 1M, released by Autodesk Research (ADSKAILab), is a synthetic dataset of 999,633 executable CadQuery Python scripts that construct parametric 3‑D models. Each entry includes the source cod...
Claude Opus 4.6/4.7 Reasoning Dataset: 8.7K Synthetic CoT Examples Across 28 Domains
angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k
The *Claude Opus 4.6/4.7 Reasoning Dataset* is a synthetic instruction‑tuning collection created entirely by Claude Opus models (versions 4.6 and 4.7). It contains 8,706 OpenAI‑style chat examples in ...
Qwen3.6‑27B Heretic‑Uncensored Finetune (NEO‑CODE‑Di‑IMatrix‑MAX) – GGUF Quantized Model Report
DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF
The Qwen3.6‑27B Heretic‑Uncensored model is a post‑trained, 27‑billion‑parameter causal language model with a vision encoder that has been stripped of safety‑filtering (“heretic”) and further finetune...
Gemma‑4 31B‑IT Assistant: Draft Model for Fast Multimodal Generation
google/gemma-4-31B-it-assistant
The **google/gemma-4-31B-it-assistant** model is the Multi‑Token Prediction (MTP) drafter for the 31‑billion‑parameter Gemma‑4 instruction‑tuned model. It extends the base Gemma‑4 model with a smaller...
DeepSeek‑V4‑Flash: 284B MoE Model with 1‑Million‑Token Context
deepseek-ai/DeepSeek-V4-Flash
DeepSeek‑V4‑Flash, released by DeepSeek‑AI, is a Mixture‑of‑Experts (MoE) language model with 284 B total parameters but only 13 B activated during inference. Built on the Transformers library and dis...
TaskTrove: 750K Agentic Tasks for RL & SFT Evaluation
open-thoughts/TaskTrove
TaskTrove is an open‑source collection of more than 750,000 unique agentic tasks gathered from over 100 public sources. Released by the OpenThoughts‑Agent team, the dataset aggregates popular reinforc...
talkie-1930-13b-it: Vintage 13B Model Tuned for Pre‑1931 English Instructions
talkie-lm/talkie-1930-13b-it
talkie-1930-13b-it is a 13‑billion‑parameter "vintage" language model that builds on the talkie-1930-13b-base checkpoint. The base model was trained on 260 B tokens drawn exclusively from English‑lang...
MiMo-V2.5: Xiaomi’s Omnimodal 1M‑Token Agent Model Takes Center Stage
XiaomiMiMo/MiMo-V2.5
MiMo-V2.5, released by Xiaomi’s MiMo team, is a native omnimodal model that unifies text, image, video, and audio understanding within a single architecture. Built on the MiMo-V2‑Flash backbone, it em...
Uncensored Qwen3.6‑27B Aggressive: Multimodal GGUF Model for Vision‑Language Tasks
HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive
The **Qwen3.6‑27B‑Uncensored‑HauhauCS‑Aggressive** model is a 27‑billion‑parameter, multilingual (English, Chinese, and other languages) vision‑language model built on the original Qwen/Qwen3.6‑27B ar...
TAAC2026 Demo Recommendation Dataset (1K Samples) – A Quick Overview
TAAC2026/data_sample_1000
The TAIC2026 data_sample_1000 dataset is a small, 1,000‑row sample of user‑item interaction records released for the TAIC2026 competition. Stored as a flat‑column Parquet file (~39 MB) it contains 120...
AtomBlock-WebUI: A Synthetic UI Detection Dataset Tailored for YOLO
ZhihaoNan/AtomBlock-WebUI
AtomBlock-WebUI is a synthetic dataset of roughly 9,700 full‑page web screenshots, each annotated with YOLO‑format bounding boxes for 14 UI element categories such as buttons, links, inputs, and struc...
GSM8K: 8K Grade‑School Math Problems Powering LLM Reasoning
openai/gsm8k
The GSM8K dataset, released by OpenAI, contains 8,473 English‑language grade‑school math word problems split into a training set of 7,473 examples and a test set of 1,319 examples. Each entry provides...
DeepSeek-V4-Flash-Base: FP8‑Optimized Model in Safetensors Format
deepseek-ai/DeepSeek-V4-Flash-Base
The DeepSeek‑V4‑Flash‑Base model (deepseek-ai/DeepSeek-V4-Flash-Base) is a recently uploaded model that belongs to the DeepSeek V4 series. Its metadata highlights a few key technical attributes: it is...
MathNet v0: Multilingual Olympiad Math Reasoning & Retrieval Dataset Gains Traction
ShadenA/MathNet
MathNet v0 is a large‑scale, multimodal dataset of Olympiad‑level mathematics problems released by ShadenA (MIT) and featured in ICLR 2026. It aggregates 30,676 expert‑authored problems from 47 countr...
Qwen3.6‑27B‑FP8 – Fast, Vision‑Enabled, Open‑Weight LLM
Qwen/Qwen3.6-27B-FP8
Qwen3.6‑27B‑FP8 is a fine‑grained FP8‑quantized version of the 27‑billion‑parameter Qwen3.6 model, compatible with Transformers, vLLM, SGLang, KTransformers and Azure endpoints. It combines a causal l...
Fenrir v2.1: 100K Defensive Cybersecurity Chat Triples for Safe LLM Fine‑Tuning
AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1
The Fenrir v2.1 dataset, authored by Alican Kiraz, offers 99,870 high‑quality *system / user / assistant* chat triples designed for instruction‑tuning of defensive‑security language models. All record...
Claude Opus 4.6 Reasoning Traces Dataset Fuels Tiny Model Fine‑Tuning
Roman1111111/claude-opus-4.6-10000x
The **Roman1111111/claude-opus-4.6-10000x** dataset is a high‑fidelity reasoning collection generated by Anthropic's Claude Opus 4.6. Each entry is a JSONL record that pairs a challenging math or logi...
VoxCPM2: 2B‑Parameter Multilingual Diffusion TTS with Voice Design & Cloning
openbmb/VoxCPM2
VoxCPM2, released by the OpenBMB VoxCPM team, is a tokenizer‑free diffusion autoregressive text‑to‑speech model that packs 2 billion parameters and supports 30 languages, including many Chinese dialec...
ParseBench: Enterprise Document Parsing Benchmark Takes Center Stage
llamaindex/ParseBench
ParseBench is a new, officially‑released benchmark for evaluating document‑parsing systems on real‑world enterprise PDFs. Curated by the LlamaIndex team, the dataset contains roughly 2,000 human‑verif...
Qwen3.6-35B-A3B Model – Highlights, Benchmarks, and Opportunities
Qwen/Qwen3.6-35B-A3B
Qwen3.6-35B-A3B is a 35‑billion‑parameter causal language model with a vision encoder, released as the first open‑weight variant of the Qwen3.6 series. It features a 262k native context window (extend...
HY‑World 2.0: Open‑Source Multi‑Modal 3D World Generation & Reconstruction
tencent/HY-World-2.0
HY‑World 2.0 is a pioneering multi‑modal world model released by Tencent that bridges text, images, multi‑view photos, and video to realistic 3D scenes. It is advertised as the first open‑source, sta...
MOSS‑TTS‑Nano: Tiny Multilingual Real‑Time TTS for CPU‑Only Apps
OpenMOSS-Team/MOSS-TTS-Nano-100M
MOSS‑TTS‑Nano is an open‑source multilingual text‑to‑speech model released by the OpenMOSS team and MOSI.AI. With only 0.1 B parameters, it targets real‑time speech generation on modest hardware – the...
One Million Reasoning Traces: KIMI‑K2.5‑1000000x Dataset
ianncity/KIMI-K2.5-1000000x
The **KIMI‑K2.5‑1000000x** dataset, authored by *ianncity* and released in March 2026, contains one million distilled reasoning traces generated from the KIMI‑K2.5 model on high‑level reasoning tasks....
⚡ Gemma‑4‑31B‑IT NVFP4 Turbo: 68% Smaller, 2.5× Faster Text Generation
LilaRest/gemma-4-31B-it-NVFP4-turbo
LilaRest’s *Gemma 4 31B IT NVFP4 Turbo* is a repackaged, quantized version of Google DeepMind’s Gemma‑4 31B‑IT model. Built on the NVIDIA NVFP4 checkpoint, it quantizes self‑attention weights to FP4 (...
OmniVoice: 600‑Language Zero‑Shot TTS Takes Center Stage
k2-fsa/OmniVoice
OmniVoice is a massively multilingual zero‑shot text‑to‑speech (TTS) model that supports over 600 languages, making it the broadest‑coverage TTS system currently available. Built on a diffusion langua...
GLM-5.1: Multilingual Agentic Coding Model Takes the Lead
zai-org/GLM-5.1
GLM-5.1, released by the ZAI organization, is the latest flagship large language model designed for agentic engineering. It is a multilingual text‑generation model (English and Chinese) built on the T...
Hermes Agent Reasoning Traces: Real Tool-Calling Trajectories for AI Agent Training
lambda/hermes-agent-reasoning-traces
The **Hermes Agent Reasoning Traces** dataset, released by lambda, provides multi‑turn tool‑calling trajectories captured from two powerful LLMs: Moonshot AI's Kimi‑K2.5 and ZhipuAI's GLM‑5.1-FP8. Eac...
Comprehensive Overview of the Multilingual Dataset
wikimedia/wikipedia
This dataset comprises a vast collection of text data spanning over 600 language configurations, each with its own training split. It includes a wide variety of linguistic resources, ranging from well...
Vietnam Real Estate Listings 2025: 1M Records for Price Prediction & Market Insight
tinixai/vietnam-real-estates
The **Tinix Vietnam Real Estate Listings 2025** dataset, curated by TiniX AI, provides a comprehensive snapshot of the Vietnamese property market with exactly **1,000,000** listings collected between ...
Redacted Coding Agent Session Traces from pi-mono – A New Dataset for Code Generation Research
badlogicgames/pi-mono
The *badlogicgames/pi-mono* dataset offers a collection of redacted coding‑agent session traces harvested from work on the open‑source *pi-mono* repository (https://github.com/badlogic/pi-mono.git). E...
NVIDIA‑Optimized Gemma‑4 31B IT NVFP4: Fast Multimodal Text Generation
nvidia/Gemma-4-31B-IT-NVFP4
The **Gemma‑4 31B IT NVFP4** model is a quantized version of Google DeepMind's open‑source Gemma‑4 31B IT multimodal transformer. Built on 30.7 B parameters with a 256K‑token context window, it accept...
Gemma 4 E2B‑IT: On‑Device Multimodal Reasoning Model Hits the Spotlight
google/gemma-4-E2B-it
Google DeepMind’s Gemma 4 E2B‑IT is the newest open‑weight multimodal model on Hugging Face, offering 2.3 B effective parameters (5.1 B total with embeddings) and a 128K token context window. Built on...
FineWeb: 18.5 T Tokens of High‑Quality English Web Text Now Open
HuggingFaceFW/fineweb
The FineWeb dataset, released by HuggingFaceFW, provides over 18.5 trillion tokens of cleaned and deduplicated English web data sourced from CommonCrawl. Licensed under ODC‑By 1.0, it targets the text...
UniSAFE: Multimodal Image‑Text Safety Dataset (Trending)
segyulee/UniSAFE
UniSAFE, authored by segyulee, is a multimodal dataset that pairs images with textual instructions and metadata describing unsafe triggers, target outcomes, and scenario types. It is stored in optimiz...
Gemma‑4 26B A4B: Open‑Weight Multimodal MoE Model for Image‑Text Reasoning
google/gemma-4-26B-A4B-it
Google DeepMind’s Gemma‑4 family expands with the 26B A4B mixture‑of‑experts (MoE) model, released under an Apache‑2.0 license. Identified on Hugging Face as `google/gemma-4-26B-A4B-it` and tagged for...
Gemma‑4 31B‑IT: A Multimodal Reasoning Powerhouse for Images, Video & Text
google/gemma-4-31B-it
The **google/gemma-4-31B-it** model is the instruction‑tuned, 31‑billion‑parameter dense variant of Google DeepMind's Gemma 4 family. Hosted on Hugging Face, it belongs to the *image‑text‑to‑text* pip...
TRIBE v2: Multimodal Brain‑Encoding Model for fMRI Prediction
facebook/tribev2
TRIBE v2 is a foundation multimodal model released by Facebook Research that predicts functional MRI (fMRI) brain responses to naturalistic stimuli across vision, audition, and language. The model int...
KIMI-K2.5-450000x: 450K High‑Quality Reasoning Traces for LLM Tuning
ianncity/KIMI-K2.5-450000x
The **KIMI-K2.5-450000x** dataset, authored by *ianncity*, contains 450,000 distilled reasoning traces generated from the KIMI‑K2.5 model under a "high" reasoning setting. With a total token count of ...
Claude Opus 4.6 Extended Reasoning Dataset: Traces for Better LLM Reasoning
TeichAI/Claude-Opus-4.6-Reasoning-887x
The **Claude Opus 4.6 Extended Reasoning** dataset, created by TeichAI, is a small (<1 K records) JSON‑formatted collection of reasoning traces generated with Anthropic's Claude Opus 4.6. It aggregat...
Efficient Multilingual Reasoning with Qwen3.5‑9B‑Claude‑Opus Distilled v2 (GGUF)
Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF
The **Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2** model is a second‑generation fine‑tune of the Qwen3.5‑9B base model, distilled from over 14,000 Claude 4.6 Opus‑style reasoning samples. Built...
Claude‑Sonnet 4.6 Reasoning Dataset: 799 Deep Thought Conversations
TeichAI/Claude-Sonnet-4.6-Reasoning-799x
The **Claude‑Sonnet‑4.6‑Reasoning‑799x** dataset, authored by TeichAI, contains 799 single‑turn user→assistant exchanges that focus exclusively on chain‑of‑thought reasoning. Each response averages ar...
Michael Hafftka Catalog Raisonné: 3.8K Paintings with Rich Metadata
Hafftka/michael-hafftka-catalog-raisonne
The **Michael Hafftka – Catalog Raisonné** dataset is a curated collection of roughly 3,800 digitized paintings by the American expressionist Michael Hafftka, spanning the period from the 1970s throug...
AutoMathText-V2 Dataset Overview Report
OpenSQZ/AutoMathText-V2
AutoMathText-V2 is a curated collection of 52 premium data sources spanning web content, mathematics, code, reasoning, formal proofs, and bilingual translation. It aggregates over 1.5 trillion tokens,...
Hacker News Dataset Report
open-index/hacker-news
The Hacker News Complete Archive mirrors every item posted on news.ycombinator.com from its inception in October 2006 through the present day, totaling over 47 million records. The data is stored in m...
OmniAction: A Massive Omni‑modal Dataset for Proactive Robot Manipulation
OpenMOSS-Team/OmniAction
The OpenMOSS-Team has released **OmniAction**, a large‑scale multimodal dataset designed for contextual instruction following in robotic manipulation. Hosted on HuggingFace, the dataset contains 141,1...
OmniCoder-9B: A 9B Coding Agent Fine‑Tuned on 425K Agentic Trajectories
Tesslate/OmniCoder-9B
OmniCoder-9B is a 9‑billion‑parameter coding agent released by Tesslate and built on top of Qwen3.5‑9B’s hybrid Gated‑Delta/standard‑attention architecture. It has been fine‑tuned with LoRA (r=64, alp...
Foundation-1: Structured Text‑to‑Sample Music Generator Takes Center Stage
RoyalCities/Foundation-1
Foundation-1, released by RoyalCities, is a next‑generation text‑to‑sample model fine‑tuned from Stability AI’s stable‑audio‑open‑1.0. Designed for modern music production, it interprets layered promp...
Screen‑Recording Dataset Powers Next‑Gen Desktop AI Agents
markov-ai/computer-use-large
The **Computer Use Large** dataset, released by *markov-ai*, contains 48,478 trimmed screen‑recording videos totalling roughly 12,300 hours of professional software usage. All videos are audio‑free an...
Uncensored Multimodal Power: Qwen3.5-35B-A3B Aggressive Variant
HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive
The **Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive** model is an uncensored, aggressive fork of the original Qwen3.5‑35B‑A3B released by the community contributor HauhauCS. With over 210 k downloads...
Fish Audio S2 Pro – Multilingual TTS with Fine‑Grained Inline Control
fishaudio/s2-pro
Fish Audio S2 Pro is a state‑of‑the‑art text‑to‑speech (TTS) model released by the Fish Audio research team. It supports more than 80 languages, including tier‑1 coverage for English, Chinese, and Jap...
olmOCR-bench: The New Standard for PDF‑to‑Markdown OCR Evaluation
allenai/olmOCR-bench
olmOCR-bench, released by AllenAI, is a benchmark dataset comprising 1,403 PDF files and 7,010 unit test cases that capture the properties a high‑quality OCR system should preserve when converting PDF...
BONES-SEED: Massive Multimodal Motion Dataset for Humanoid Robotics
bones-studio/seed
BONES-SEED (Skeletal Everyday Embodiment Dataset) is an open collection of 142,220 annotated human motion captures designed for humanoid robotics research. The dataset provides each motion in three sk...
FinePhrase: 1.35B Synthetic Samples for FAQ, Math, Tables & Tutorials
HuggingFaceFW/finephrase
FinePhrase is a massive synthetic dataset created by DataTrove using the SmolLM2-1.7B-Instruct model. It re‑writes source documents from the FineWeb‑Edu corpus into four distinct instructional formats...
NVIDIA Nemotron‑3 Super 120B FP8: Massive Context & Agentic Reasoning Model
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
NVIDIA's Nemotron‑3 Super 120B‑A12B‑FP8 is a 120‑billion‑parameter large language model (with 12 B active parameters) released on March 11, 2026. Built on the Transformers library and tagged for text‑...
Open-RL: Verifiable STEM Reasoning Dataset for Outcome‑Supervised RL
TuringEnterprises/Open-RL
The **Open-RL** dataset, released by **TuringEnterprises** on March 2, 2026, offers a compact collection (<1K entries) of self‑contained, verifiable STEM reasoning problems spanning physics, mathemati...
Uncensored Power: Qwen3.5-9B Aggressive Model Goes Multimodal
HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive
The Qwen3.5-9B-Uncensored-HauhauCS-Aggressive model is a 9 billion‑parameter language model released by HauhauCS that removes all refusal filters from the original Qwen3.5‑9B architecture. According t...
DyNativeGaussian_sequence: A Multimodal Text‑3D Dataset Gains Traction
LeeXiangNO1/DyNativeGaussian_sequence
The dataset "LeeXiangNO1/DyNativeGaussian_sequence" is a recently popular multimodal collection authored by LeeXiangNO1. It contains both textual and 3D data, as indicated by the tags "modality:text" ...
ALL Bench Leaderboard 2026: The First Unified Multi‑Modal AI Benchmark
FINAL-Bench/ALL-Bench-Leaderboard
The **ALL Bench Leaderboard 2026** dataset, curated by FINAL‑Bench, aggregates benchmark scores for more than 90 AI models across six modalities—LLMs, VLMs, autonomous agents, image generation, video ...
Unsloth’s GGUF‑Quantized Qwen3.5‑35B‑A3B: Vision‑Language Power on a Laptop
unsloth/Qwen3.5-35B-A3B-GGUF
The **unsloth/Qwen3.5-35B-A3B-GGUF** repository provides a GGUF‑quantized checkpoint of the Qwen3.5‑35B‑A3B model, repackaged by the Unsloth community. With a **pipeline tag of `image-text-to-text`**...
UnsLoTh Qwen3.5-9B GGUF Model – Trending Overview
unsloth/Qwen3.5-9B-GGUF
The unsloth/Qwen3.5-9B-GGUF model is a 9‑billion‑parameter multimodal (vision‑language) foundation model quantized to the GGUF format using Unsloth Dynamic 2.0, offering superior accuracy and low‑late...
Phi-4 Reasoning Vision 15B: Multimodal AI with Chain‑of‑Thought Power
microsoft/Phi-4-reasoning-vision-15B
Phi-4-Reasoning-Vision-15B is an open‑weight multimodal model released by Microsoft on March 4, 2026. It combines the Phi‑4‑Reasoning language backbone (5 B–15 B parameters) with a SigLIP‑2 vision enc...
Unsloth Qwen3.5-4B GGUF – Trending Multimodal LLM (2026-03-08)
unsloth/Qwen3.5-4B-GGUF
The unsloth‑quantized Qwen3.5‑4B GGUF model is a 4‑billion‑parameter causal language model with an integrated vision encoder. It supports a native context length of 262K tokens (extendable to >1M) and...
The Stack v2: Massive Multilingual Code Corpus for AI
bigcode/the-stack-v2
The Stack v2, released by the BigCode team, is a gargantuan dataset of source code harvested from over 600 programming languages. Tagged for text-generation tasks, it provides raw code files along wit...
Qwen3.5‑397B‑A17B: Ultra‑Large Multimodal Model Redefines Vision‑Language AI
Qwen/Qwen3.5-397B-A17B
Qwen3.5‑397B‑A17B is a next‑generation multimodal language model released by the Alibaba‑Qwen team. It is an image‑to‑text (image‑text‑to‑text) model built on a causal decoder architecture and equipp...
LocoOperator-4B: A 4B‑Parameter Local Code‑Explorer Agent
LocoreMind/LocoOperator-4B
LocoOperator-4B is a 4 billion‑parameter tool‑calling agent released by LocoreMind. It is built on the Qwen3‑4B‑Instruct‑2507 base model and distilled from the Qwen3‑Coder‑Next teacher using full‑para...
Qwen3.5-122B-A10B – A 122‑Billion‑Parameter Sparse Mixture‑of‑Experts Vision‑Language Model
Qwen/Qwen3.5-122B-A10B
Qwen3.5-122B-A10B is a 122‑billion‑parameter causal language model with a vision encoder that activates only ~10 B parameters per inference via a 256‑expert Mixture‑of‑Experts architecture (8 routed +...
CCN Dataset: Tabular Classification for Advanced Route Recommendation
GD-ML/CCN
The GD-ML/CCN dataset, released by the GD-ML team, supports the research paper *Towards Full Candidate Interaction: A Comprehensive Comparison Network for Better Route Recommendation*. It is a tabular...
Real Slop: 155k Real LLM Interactions for Dialogue & Safety Research
Solenopsisbot/real-slop
Real Slop is a Hugging Face dataset released by the user Solenopsisbot that aggregates 155,000 real‑world language model interactions in English. The entries span a variety of model families and are s...
GitHub Top Developer Source Code: 1.3M+ Files for Code Intelligence
ronantakizawa/github-top-code
The **GitHub Top Developer Source Code** dataset, authored by *ronantakizawa*, aggregates over **1.3 million source code files** contributed by the most highly ranked GitHub developers between 2015 an...
Coding Agent Conversations: 549 Sessions of AI Tool Use
peteromallet/dataclaw-peteromallet
The *Coding Agent Conversations* dataset (ID: `peteromallet/dataclaw-peteromallet`) is a collection of 549 logged sessions where large language models act as coding assistants. Each session records me...
OpenResearcher Dataset: Structured LLM Interaction Traces for Tool‑Use Research
OpenResearcher/OpenResearcher-Dataset
The OpenResearcher/OpenResearcher-Dataset is a curated collection of 6,102 multi‑turn conversational examples, each tied to a unique question (qid), a reference answer, and a detailed message log. The...
Common Corpus: 2.3 T Token Open Multilingual Text Dataset
PleIAs/common_corpus
Common Corpus, released by PleIAs and a network of partners, is currently the largest openly licensed text collection, containing 2.27 trillion tokens across more than a dozen languages. The dataset a...
ToolMind-Web-QA: Synthetic Multi‑Hop Web‑Search QA for Long‑Horizon Agents
Nanbeige/ToolMind-Web-QA
ToolMind-Web-QA is a publicly released, synthetic dataset created by Nanbeige for research on search‑augmented and long‑horizon search agents. It contains roughly 6,000 complex question‑answer pairs g...
DeepGen 1.0 Image Dataset: Small Yet Powerful Multimodal Training Resource
deepgenteam/DeepGen-1.0
The **deepgenteam/DeepGen-1.0** dataset is a lightweight image collection released by the DeepGen team. Hosted on Hugging Face, it follows the *imagefolder* format, is licensed under Apache‑2.0, and f...
Kimi K2.5: 1‑Trillion‑Parameter Multimodal Agent for Vision‑Language Reasoning
moonshotai/Kimi-K2.5
Kimi K2.5, released by Moonshot AI, is an open‑source, native multimodal model that bridges vision and language through a 1‑trillion‑parameter Mixture‑of‑Experts architecture. Built on top of the Kimi...
MolmoSpaces: A Rich Asset Hub for Robotics and Embodied AI
allenai/molmospaces
MolmoSpaces is a dataset released by the Allen Institute for AI (AI2) that bundles asset data for the MolmoSpaces project. It provides a comprehensive collection of 3‑D objects, robot models, scene de...
MiniMax-M2.5 Model Overview and Insights
MiniMaxAI/MiniMax-M2.5
MiniMax-M2.5 is the latest frontier model from MiniMax AI, excelling in coding, agentic tool use, search, and office work. Trained with reinforcement learning across hundreds of thousands of real-worl...
Fine-T2I: 6M High‑Quality Text‑Image Pairs for Open T2I Fine‑Tuning
ma-xu/fine-t2i
Fine‑T2I is a large‑scale, open dataset released by Xu Ma, Yitian Zhang, Qihua Dong, and Yun Fu from Northeastern University. It contains over 6.15 million text–image pairs (about 2 TB) organized in W...
Qwen3‑TTS 1.7B CustomVoice: Real‑Time Multilingual Speech with Instruction‑Driven Style
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice is a 1.7 B‑parameter text‑to‑speech model released by the Qwen team. It supports ten major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portu...
Chinese-Fineweb-Edu-V2.2 Dataset: Quickstart Guide
opencsg/Fineweb-Edu-Chinese-V2.2
The Chinese-Fineweb-Edu-V2.2 dataset provides high‑quality Chinese educational text for large‑language‑model pre‑training and instruction fine‑tuning. It is organized into three tiers of pre‑training ...
Kitchen Robotics: 600 Hours of Human Tele‑Operated Demonstrations
nvidia/PhysicalAI-Robotics-Kitchen-Sim-Demos
PhysicalAI‑Robotics‑Kitchen‑Sim‑Demos is a large‑scale dataset released by NVIDIA that captures 600 hours of human‑teleoperated manipulation in a simulated kitchen environment. The data spans 316 dist...
GLM-5: 744B LLM with Sparse Attention, Tool Use, and Long‑Context Capabilities
zai-org/GLM-5
GLM-5, released by the ZAI organization, is a massive multilingual language model targeting complex systems engineering and long‑horizon agentic tasks. It scales up to 744 B parameters (with 40 B acti...
GLM-OCR: Multilingual, High‑Performance OCR for Complex Documents
zai-org/GLM-OCR
GLM-OCR is a multimodal OCR model built on the GLM‑V encoder‑decoder architecture, integrating the CogViT visual encoder and a lightweight cross‑modal connector with a GLM‑0.5B language decoder. It op...
Exploring the Massive International Travel Text Dataset
GD-ML/IntTravel_dataset
The GD-ML/IntTravel_dataset is a large‑scale text collection hosted on the Hugging Face Hub. According to its metadata, the dataset falls in the 100 M < size < 1 B range, is stored in CSV format, and ...
Moonworks Lunara Aesthetic II: High‑Quality Image Variation Dataset
moonworks/lunara-aesthetic-image-variations
The **Moonworks Lunara Aesthetic II** dataset, released by the creator *moonworks*, provides 2,854 paired images designed for research on image editing, image‑to‑image generation, and identity preserv...
DeepPlanning: Benchmarking Long‑Horizon Agentic Planning with Constraints
Qwen/DeepPlanning
DeepPlanning is a newly released dataset from Qwen that serves as a benchmark for evaluating the long‑horizon planning abilities of large language models (LLMs). It focuses on agentic tasks where mod...
Intern‑S1‑Pro: Trillion‑Scale Multimodal Scientific Reasoner Takes the Lead
internlm/Intern-S1-Pro
Intern‑S1‑Pro, released by the InternLM team, is a trillion‑parameter mixture‑of‑experts (MoE) foundation model that targets scientific multimodal reasoning. Tagged with **image‑text‑to‑text**, it acc...
Moltbook Annotated Posts & Submolts: A Rich Resource for Content Classification
TrustAIRLab/Moltbook
The Moltbook Dataset, released by TrustAIRLab, provides over 44,000 GPT‑5.2‑annotated posts and 12,209 submolts harvested from the agent social network Moltbook. Each post is labeled with one of nine ...
NVIDIA Personaplex 7B: Audio‑to‑Audio Model Gains Traction
nvidia/personaplex-7b-v1
The model nvidia/personaplex-7b-v1 is an audio‑to‑audio (speech‑to‑speech) model hosted on Hugging Face. It is built with the Moshi library and distributed in the safetensors format, indicating a focu...
TeichAI's Small Text Dataset for Claude‑4.5 Opus Reasoning Gains Traction
TeichAI/claude-4.5-opus-high-reasoning-250x
The dataset **TeichAI/claude-4.5-opus-high-reasoning-250x** is a compact collection of text entries (size category n<1K) stored in JSON format. Created by the user *TeichAI* on November 27, 2025, it h...
Qwen3-Coder-Next: The Trending Code‑Focused Text Generation Model
Qwen/Qwen3-Coder-Next
Qwen/Qwen3-Coder-Next is a newly released transformer model that targets text‑generation tasks, as indicated by its pipeline_tag. The model’s identifier – “Coder‑Next” – suggests a focus on programmin...
Tencent's CL-bench: A New Benchmark for Long-Context Text Generation
tencent/CL-bench
The CL-bench dataset, released by Tencent, is a recently trending English-language benchmark designed for text‑generation tasks that require handling long contexts. It contains between 1,000 and 10,00...