dataset June 29, 2026

HIW-500: 500 Hours of Humanoid Robot Learning in Real Homes

BitRobot/HIW-500

HIW-500 (Humanoids In-the-Wild) is a large‑scale dataset released by BitRobot that captures whole‑body teleoperation of the Unitree G1 humanoid robot across real household environments in Southeast As...

model June 28, 2026

Ornith‑1.0‑35B: Open‑Source Agentic Coding Model Shines on Benchmarks

deepreinforce-ai/Ornith-1.0-35B

Ornith-1.0-35B is a 35‑billion‑parameter mixture‑of‑experts (MoE) language model released by the Deep Reinforce team. Built on top of Qwen 3.5 and Gemma 4, it belongs to the Ornith-1.0 family of *self...

dataset June 27, 2026

Unlocking arXiv’s LaTeX Treasure: A Monthly Parquet Dataset for Researchers

scholarweave/arxiv-latex

The **arXiv LaTeX Source Dataset** (ID: `scholarweave/arxiv-latex`) offers the complete, pre‑parsed LaTeX source files of every arXiv paper, aligned with the official metadata and stored in ready‑to‑q...

dataset June 26, 2026

AgentWorldBench: Benchmarking Language World Models Across Real‑World Environments

Qwen/AgentWorldBench

AgentWorldBench, released by Qwen, is a curated evaluation benchmark for language world models that simulate environments such as APIs, search engines, terminals, IDEs, Android UIs, web browsers, and ...

dataset June 25, 2026

ITBench-AA: SRE Incident Scenarios for Kubernetes Root‑Cause Analysis

ArtificialAnalysis/ITBench-AA

The **ITBench-AA** dataset, released by ArtificialAnalysis, is a curated subset of IBM's ITBench benchmark focused on Site Reliability Engineering (SRE) scenarios. It contains 40 public Kubernetes inc...

model June 24, 2026

Qwythos-9B Claude Mythos GGUF: 1M‑token, multimodal, uncensored reasoning model

empero-ai/Qwythos-9B-Claude-Mythos-5-1M-GGUF

The **Qwythos-9B-Claude-Mythos-5-1M-GGUF** model, released by Empero AI, is a GGUF‑quantized version of the base Qwythos‑9B Claude Mythos model. It is built on the Qwen3.5‑9B architecture and has been...

dataset June 23, 2026

Complete FABLE.5 Traces 2M: A Massive Deduped Agentic Trace Corpus

Glint-Research/Complete-FABLE.5-traces-2M

The Complete FABLE.5 Traces 2M dataset, curated by Glint-Research, aggregates over 2 million cleaned rows of agentic interaction traces from the FABLE.5 / Mythos family of corpora. After a post‑closur...

dataset June 22, 2026

Inside the Claude Code Session that Built a 3D Boeing 747

victor/fable-5-boeing-747-trace

The **victor/fable-5-boeing-747-trace** dataset is a compact (under 1 KB) JSONL collection that records an entire Claude Code (Fable 5) agent session. Authored by "victor" and released under an MIT li...

model June 21, 2026

Jackrong Qwopus‑3.6‑27B‑Coder: 27‑B Agentic Coding Model (GGUF)

Jackrong/Qwopus3.6-27B-Coder-MTP-GGUF

Qwopus‑3.6‑27B‑Coder is a 27‑billion‑parameter LLM released in GGUF format for efficient inference with llama.cpp and llama‑cpp‑compatible runtimes. Billed as a “Coder SFT” release, it builds on the r...

model June 20, 2026

GLM-5.2: 1‑Million‑Token LLM Unleashed for Multilingual Long‑Context Generation

zai-org/GLM-5.2

GLM-5.2, the latest flagship model from the GLM‑5 team, is a text‑generation LLM built on the Transformers library and released under an MIT license. It supports both English and Chinese and is design...

model June 19, 2026

Kimi K2.7 Code GGUF: Multimodal Coding Agent Takes the Lead

unsloth/Kimi-K2.7-Code-GGUF

The **unsloth/Kimi-K2.7-Code-GGUF** model is a quantized (GGUF) version of Moonshot AI's Kimi K2.7 Code, released by the Unsloth community. It is classified under the `image-text-to-text` pipeline and...

model June 18, 2026

VibeThinker-3B: Small‑Scale Reasoning Powerhouse for Math, Code, and STEM

WeiboAI/VibeThinker-3B

VibeThinker-3B, released by WeiboAI, is a 3‑billion‑parameter language model built on top of Qwen/Qwen2.5-Coder-3B. It targets verifiable reasoning tasks such as mathematics, competitive programming, ...

model June 17, 2026

DiffusionGemma 26B GGUF: Fast Multimodal Generation on Your GPU

unsloth/diffusiongemma-26B-A4B-it-GGUF

DiffusionGemma-26B-A4B-it is a 26‑billion‑parameter multimodal model from Google DeepMind that generates text from interleaved text, image, and video inputs. The unsloth repository provides GGUF‑quant...

dataset June 16, 2026

KSAFE-MM: Korean Multimodal Safety Benchmark Gains Traction

K-intelligence/KSAFE-MM

KSAFE-MM, released by K‑intelligence, is a Korean multimodal safety benchmark designed for academic research on AI safety. The dataset comprises 14,135 query‑image pairs split into two subsets: *KSAFE...

model June 15, 2026

Gemma4‑12B‑Coder GGUF: Tiny Local Python Coding Assistant

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

Gemma4-12B-Coder (GGUF) is a community‑fine‑tuned version of Google’s Gemma‑4 12B IT model, focused on verifiable Python code generation. The model was distilled from two chain‑of‑thought (CoT) source...

model June 14, 2026

North Mini Code 1.0 – 30B Agentic Code Generator with Tool Use

CohereLabs/North-Mini-Code-1.0

North Mini Code 1.0 is a 30 B‑parameter (3 B active) decoder‑only Transformer released by Cohere Labs. The model is built as a sparse Mixture‑of‑Experts (128 experts, 8 active per token) and features ...

model June 13, 2026

Unsloth’s GGUF‑Quantized Gemma‑4 12B: Multimodal Reasoning on Your Device

unsloth/gemma-4-12b-it-GGUF

The *unsloth/gemma-4-12b-it-GGUF* model is a 4‑bit GGUF quantized version of Google DeepMind's Gemma‑4 12B instruction‑tuned transformer. Built on the open‑source Gemma‑4 family, it inherits multimoda...

model June 12, 2026

Unsloth Gemma‑4 12B QAT GGUF: Fast, Multimodal Any‑to‑Any Model

unsloth/gemma-4-12B-it-qat-GGUF

The **unsloth/gemma-4-12B-it-qat-GGUF** model is a Quantization‑Aware Training (QAT) checkpoint of Google DeepMind's Gemma‑4 12B instruction‑tuned model, packaged by Unsloth in the GGUF format. It ret...

model June 11, 2026

DiffusionGemma 26B: Fast Multimodal Text Generation with Vision and Reasoning

google/diffusiongemma-26B-A4B-it

DiffusionGemma-26B-A4B-it is an open‑weights multimodal model from Google DeepMind, built on the 26‑billion‑parameter A4B Mixture‑of‑Experts (MoE) Gemma 4 architecture. It follows a novel discrete dif...

dataset June 10, 2026

MR‑RATE: Massive 3D MRI‑Report Dataset Fuels Multimodal Medical AI

Forithmus/MR-RATE

The MR‑RATE dataset, released by Forithmus in partnership with the University of Zurich and NVIDIA, provides a comprehensive collection of 705,254 non‑contrast and contrast‑enhanced brain and spine MR...

dataset June 09, 2026

Ultra-FineWeb-L3: Massive Bilingual Synthetic QA & Style‑Rich Corpus for LLM Pre‑training

openbmb/Ultra-FineWeb-L3

Ultra-FineWeb-L3, released by the openbmb team, is the L3 tier of the UltraData tiered data management framework. It refines the trillion‑token Ultra-FineWeb web corpus through two synthesis steps—Q&A...

model June 08, 2026

Ideogram 4 (nf4): Open‑Weight 9.3B Text‑to‑Image Diffusion Model with JSON Prompting

ideogram-ai/ideogram-4-nf4

Ideogram 4 (nf4) is Ideogram AI's first open‑weight text‑to‑image model, released in June 2026. Built from scratch with 9.3 B parameters, it uses a fully single‑stream Diffusion Transformer (DiT) arch...

model June 07, 2026

Nemotron 3.5 ASR: 600M‑parameter Multilingual Streaming Speech‑to‑Text

nvidia/nemotron-3.5-asr-streaming-0.6b

NVIDIA’s Nemotron 3.5 ASR (model ID nvidia/nemotron-3.5-asr-streaming-0.6b) is a 600 million‑parameter, cache‑aware FastConformer‑RNNT model built with the NeMo framework. It delivers low‑latency, str...

dataset June 06, 2026

GLM‑5.1‑Reasoning‑1M‑Cleaned: A Curated Corpus for Chain‑of‑Thought Fine‑Tuning

Jackrong/GLM-5.1-Reasoning-1M-Cleaned

The **GLM‑5.1‑Reasoning‑1M‑Cleaned** dataset, released by user **Jackrong**, is a cleaned and reformatted derivative of the original *Kassadin88/GLM-5.1-1000000x* collection. It contains 746,321 high‑...

model June 05, 2026

NVIDIA Qwen3.6-35B-A3B-NVFP4: FP4‑Quantized MoE Model for Fast Multimodal Text Generation

nvidia/Qwen3.6-35B-A3B-NVFP4

The NVIDIA Qwen3.6-35B-A3B-NVFP4 model is a 4‑bit (FP4) quantized version of Alibaba's Qwen3.6-35B-A3B foundation model, produced with NVIDIA's Model Optimizer. It retains the original 35 billion‑para...

dataset June 04, 2026

UltraData‑SFT‑2605: 15M+ High‑Quality SFT Samples for MiniCPM5‑1B and Beyond

openbmb/UltraData-SFT-2605

UltraData‑SFT‑2605, released by OpenBMB, is the full core‑domain supervised‑fine‑tuning (SFT) dataset that powered the post‑training of MiniCPM5‑1B‑SFT, the first 1‑billion‑parameter model in the Mini...

dataset June 03, 2026

Claude‑Opus 4.6 Trace Inversion: 9K High‑Fidelity Reasoning Traces

Jackrong/Claude-opus-4.6-TraceInversion-9000x

Claude‑Opus‑4.6‑TraceInversion‑9000x is a synthetic, multilingual chain‑of‑thought (CoT) dataset created by Jackrong using the novel Trace Inversion technique. The dataset contains 9,000 gzipped JSON‑...

model June 02, 2026

Marlin‑2B: Tiny Video VLM for Structured Captioning & Temporal Grounding

NemoStation/Marlin-2B

Marlin‑2B is an open‑source 2‑billion‑parameter video‑capable Vision‑Language Model (VLM) built on top of Qwen3.5‑2B. Developed by the NemoStation team, it adds two developer‑friendly modes—`caption`...

model June 01, 2026

HRM-Text-1B: Hierarchical Reasoning Model for Efficient Text Generation

sapientinc/HRM-Text-1B

HRM-Text-1B is a 1 billion‑parameter language model released by Sapient Intelligence. It implements the Hierarchical Reasoning Model (HRM) architecture, a dual‑timescale recurrent design where a high‑...

model May 31, 2026

Hy‑MT2‑30B‑A3B: Tencent's Massive Multilingual Translation Engine Hits the Spotlight

tencent/Hy-MT2-30B-A3B

Tencent's Hy‑MT2‑30B‑A3B is the flagship member of the Hy‑MT2 family, a set of "fast‑thinking" multilingual translation models released in May 2026. Built on the Transformers library and released as a...

model May 30, 2026

MiniCPM5‑1B: Open‑Source 1B‑Parameter LLM for Edge AI and Tool‑Calling

openbmb/MiniCPM5-1B

MiniCPM5-1B, released by the OpenBMB team, is a dense 1.08 B‑parameter causal language model built on the standard LlamaForCausalLM architecture. It targets on‑device and resource‑constrained deployme...

dataset May 29, 2026

MONET: 105M Enriched Image‑Text Pairs for Text‑to‑Image Pre‑training

jasperai/monet

The MONET (Massive, Open, Non‑redundant and Enriched Text‑to‑image) dataset provides 104.9 million high‑quality image‑text pairs curated from 2.9 billion raw pairs across nine open sources. It combine...

model May 28, 2026

Uncensored Multimodal Power: Qwen3.6-35B-A3B Aggressive GGUF Model Takes the Lead

HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

The Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive model is a 35‑billion‑parameter Mixture‑of‑Experts (MoE) transformer that retains the full capabilities of the original Qwen3.6‑35B‑A3B while removin...

model May 27, 2026

NuExtract3: 4B Vision‑Language Model for Structured Document Extraction & Markdown OCR

numind/NuExtract3

NuExtract3 is a 4 billion‑parameter vision‑language model built on Qwen/Qwen3.5‑4B and released by Numind. Tagged for image‑to‑text, document‑understanding, OCR, and structured information extraction...

model May 26, 2026

Hy-MT2-1.8B: Fast‑Thinking Multilingual Translation Model Hits the Spotlight

tencent/Hy-MT2-1.8B

Tencent's Hy‑MT2‑1.8B is a 1.8 billion‑parameter multilingual translation model released on Hugging Face in May 2026. Built with the Transformers library and distributed as safetensors, it supports in...

dataset May 25, 2026

Wikimedia Structured Wikipedia: Ready‑to‑Query Multilingual Knowledge in Parquet

wikimedia/structured-wikipedia

The Wikimedia Structured Wikipedia dataset provides pre‑parsed English and French Wikipedia articles in a unified Parquet format, totalling 44.42 GiB and over 10 million rows. Built by the Wikimedia E...

model May 24, 2026

Command A+ 05‑2026: Multilingual Vision‑Enabled Chatbot with Tool Use

CohereLabs/command-a-plus-05-2026-bf16

Command A+ is an open‑source, 25 billion‑parameter decoder‑only Sparse Mixture‑of‑Experts model released by Cohere and Cohere Labs. It supports both text and image inputs (pipeline tag **image-text-to...

dataset May 23, 2026

Explore the Dreamcore Aesthetic with 1K AI‑Generated Images

LukaDev13/Liminal-Dreamcore-1K

The **LukaDev13/Liminal-Dreamcore-1K** dataset is a curated collection of 1,000 AI‑generated images that embody the "dreamcore" aesthetic. Created by LukaDev13 and released under the MIT license, the ...

dataset May 22, 2026

AgentTrove Dataset Overview

open-thoughts/AgentTrove

AgentTrove (open-thoughts/AgentTrove) is the largest publicly available collection of agentic interaction traces, comprising 1,696,847 rows from 219 source datasets covering code repair, shell scripti...

model May 21, 2026

Ring-2.6-1T: Trillion‑Parameter Reasoning Model for Agent‑Driven Workflows

inclusionAI/Ring-2.6-1T

Ring-2.6-1T, released by inclusionAI, is a trillion‑parameter text‑generation model built on the Transformers library and distributed as safetensors. It is positioned as a flagship reasoning model fo...

model May 19, 2026

Fast Multimodal Qwen3.6 MTP GGUF: Image‑Video Chat, Tool‑Calling & Agentic Apps

unsloth/Qwen3.6-27B-MTP-GGUF

The **unsloth/Qwen3.6-27B-MTP-GGUF** model is a GGUF‑quantized version of Qwen3.6‑27B, released by the Unsloth community. It targets the *image-text-to-text* pipeline, meaning it can ingest images, vi...

dataset May 18, 2026

Jagle: A 9M‑Instance Japanese Vision‑Language Dataset for VQA and Multimodal Training

llm-jp/Jagle

Jagle is a large‑scale Japanese multimodal post‑training dataset released by the llm‑jp team. It contains roughly 9.2 million image‑text instances covering a wide range of tasks, most notably visual q...

model May 17, 2026

Sulphur‑2 Base: Open‑Source Text‑to‑Video Generator Takes the Spotlight

SulphurAI/Sulphur-2-base

SulphurAI's **Sulphur‑2‑base** is an uncensored video generation model built on the LTX 2.3 architecture. It supports both text‑to‑video (t2v) and image‑to‑video (i2v) generation natively, and can han...

model May 16, 2026

DeepSeek-V4-Pro: 1M‑Token MoE Model Redefines Long‑Context Generation

deepseek-ai/DeepSeek-V4-Pro

DeepSeek-V4-Pro is the flagship preview model of the DeepSeek‑V4 series, released by DeepSeek‑AI in April 2026. It is a Mixture‑of‑Experts (MoE) language model with 1.6 trillion total parameters, of w...

dataset May 15, 2026

DeepSeek V4 Pro Hermes Reasoning Traces: LoRA Fine‑Tuning Dataset

r0b0tlab/deepseek-hermes-reasoning-traces

The *deepseek-hermes-reasoning-traces* dataset, published by r0b0tlab, contains 19,331 multi‑turn ChatML conversations enriched with Hermes‑style reasoning and tool‑calling annotations. Generated by t...

dataset May 14, 2026

Open-MM-RL: Multimodal STEM Reasoning Benchmark for RL and QA

TuringEnterprises/Open-MM-RL

Open-MM-RL, released by TuringEnterprises, is a small but highly focused multimodal dataset for STEM question answering. It contains 40 training examples (with 3,000 more planned) that pair textual qu...

dataset May 13, 2026

FineWeb‑Edu: 1.3 T Tokens of Curated Educational Web Data

HuggingFaceFW/fineweb-edu

FineWeb‑Edu is a massive English‑language dataset of 1.3 trillion tokens drawn from the public CommonCrawl web, filtered to retain only high‑quality educational content. The dataset was created by Hug...

dataset May 11, 2026

Zero-to-CAD 1M: A Million Synthetic, Executable CAD Programs for AI Generation

ADSKAILab/Zero-To-CAD-1m

Zero-to-CAD 1M, released by Autodesk Research (ADSKAILab), is a synthetic dataset of 999,633 executable CadQuery Python scripts that construct parametric 3‑D models. Each entry includes the source cod...

dataset May 10, 2026

Claude Opus 4.6/4.7 Reasoning Dataset: 8.7K Synthetic CoT Examples Across 28 Domains

angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k

The *Claude Opus 4.6/4.7 Reasoning Dataset* is a synthetic instruction‑tuning collection created entirely by Claude Opus models (versions 4.6 and 4.7). It contains 8,706 OpenAI‑style chat examples in ...

model May 09, 2026

Qwen3.6‑27B Heretic‑Uncensored Finetune (NEO‑CODE‑Di‑IMatrix‑MAX) – GGUF Quantized Model Report

DavidAU/Qwen3.6-27B-Heretic-Uncensored-FINETUNE-NEO-CODE-Di-IMatrix-MAX-GGUF

The Qwen3.6‑27B Heretic‑Uncensored model is a post‑trained, 27‑billion‑parameter causal language model with a vision encoder that has been stripped of safety‑filtering (“heretic”) and further finetune...

model May 07, 2026

Gemma‑4 31B‑IT Assistant: Draft Model for Fast Multimodal Generation

google/gemma-4-31B-it-assistant

The **google/gemma-4-31B-it-assistant** model is the Multi‑Token Prediction (MTP) drafter for the 31‑billion‑parameter Gemma‑4 instruction‑tuned model. It extends the base Gemma‑4 model with a smaller...

model May 06, 2026

DeepSeek‑V4‑Flash: 284B MoE Model with 1‑Million‑Token Context

deepseek-ai/DeepSeek-V4-Flash

DeepSeek‑V4‑Flash, released by DeepSeek‑AI, is a Mixture‑of‑Experts (MoE) language model with 284 B total parameters but only 13 B activated during inference. Built on the Transformers library and dis...

dataset May 05, 2026

TaskTrove: 750K Agentic Tasks for RL & SFT Evaluation

open-thoughts/TaskTrove

TaskTrove is an open‑source collection of more than 750,000 unique agentic tasks gathered from over 100 public sources. Released by the OpenThoughts‑Agent team, the dataset aggregates popular reinforc...

model May 04, 2026

talkie-1930-13b-it: Vintage 13B Model Tuned for Pre‑1931 English Instructions

talkie-lm/talkie-1930-13b-it

talkie-1930-13b-it is a 13‑billion‑parameter "vintage" language model that builds on the talkie-1930-13b-base checkpoint. The base model was trained on 260 B tokens drawn exclusively from English‑lang...

model May 03, 2026

MiMo-V2.5: Xiaomi’s Omnimodal 1M‑Token Agent Model Takes Center Stage

XiaomiMiMo/MiMo-V2.5

MiMo-V2.5, released by Xiaomi’s MiMo team, is a native omnimodal model that unifies text, image, video, and audio understanding within a single architecture. Built on the MiMo-V2‑Flash backbone, it em...

model May 02, 2026

Uncensored Qwen3.6‑27B Aggressive: Multimodal GGUF Model for Vision‑Language Tasks

HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive

The **Qwen3.6‑27B‑Uncensored‑HauhauCS‑Aggressive** model is a 27‑billion‑parameter, multilingual (English, Chinese, and other languages) vision‑language model built on the original Qwen/Qwen3.6‑27B ar...

dataset May 01, 2026

TAAC2026 Demo Recommendation Dataset (1K Samples) – A Quick Overview

TAAC2026/data_sample_1000

The TAIC2026 data_sample_1000 dataset is a small, 1,000‑row sample of user‑item interaction records released for the TAIC2026 competition. Stored as a flat‑column Parquet file (~39 MB) it contains 120...

dataset April 30, 2026

AtomBlock-WebUI: A Synthetic UI Detection Dataset Tailored for YOLO

ZhihaoNan/AtomBlock-WebUI

AtomBlock-WebUI is a synthetic dataset of roughly 9,700 full‑page web screenshots, each annotated with YOLO‑format bounding boxes for 14 UI element categories such as buttons, links, inputs, and struc...

dataset April 29, 2026

GSM8K: 8K Grade‑School Math Problems Powering LLM Reasoning

openai/gsm8k

The GSM8K dataset, released by OpenAI, contains 8,473 English‑language grade‑school math word problems split into a training set of 7,473 examples and a test set of 1,319 examples. Each entry provides...

model April 28, 2026

DeepSeek-V4-Flash-Base: FP8‑Optimized Model in Safetensors Format

deepseek-ai/DeepSeek-V4-Flash-Base

The DeepSeek‑V4‑Flash‑Base model (deepseek-ai/DeepSeek-V4-Flash-Base) is a recently uploaded model that belongs to the DeepSeek V4 series. Its metadata highlights a few key technical attributes: it is...

dataset April 27, 2026

MathNet v0: Multilingual Olympiad Math Reasoning & Retrieval Dataset Gains Traction

ShadenA/MathNet

MathNet v0 is a large‑scale, multimodal dataset of Olympiad‑level mathematics problems released by ShadenA (MIT) and featured in ICLR 2026. It aggregates 30,676 expert‑authored problems from 47 countr...

model April 26, 2026

Qwen3.6‑27B‑FP8 – Fast, Vision‑Enabled, Open‑Weight LLM

Qwen/Qwen3.6-27B-FP8

Qwen3.6‑27B‑FP8 is a fine‑grained FP8‑quantized version of the 27‑billion‑parameter Qwen3.6 model, compatible with Transformers, vLLM, SGLang, KTransformers and Azure endpoints. It combines a causal l...

dataset April 25, 2026

Fenrir v2.1: 100K Defensive Cybersecurity Chat Triples for Safe LLM Fine‑Tuning

AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1

The Fenrir v2.1 dataset, authored by Alican Kiraz, offers 99,870 high‑quality *system / user / assistant* chat triples designed for instruction‑tuning of defensive‑security language models. All record...

dataset April 24, 2026

Claude Opus 4.6 Reasoning Traces Dataset Fuels Tiny Model Fine‑Tuning

Roman1111111/claude-opus-4.6-10000x

The **Roman1111111/claude-opus-4.6-10000x** dataset is a high‑fidelity reasoning collection generated by Anthropic's Claude Opus 4.6. Each entry is a JSONL record that pairs a challenging math or logi...

model April 23, 2026

VoxCPM2: 2B‑Parameter Multilingual Diffusion TTS with Voice Design & Cloning

openbmb/VoxCPM2

VoxCPM2, released by the OpenBMB VoxCPM team, is a tokenizer‑free diffusion autoregressive text‑to‑speech model that packs 2 billion parameters and supports 30 languages, including many Chinese dialec...

dataset April 22, 2026

ParseBench: Enterprise Document Parsing Benchmark Takes Center Stage

llamaindex/ParseBench

ParseBench is a new, officially‑released benchmark for evaluating document‑parsing systems on real‑world enterprise PDFs. Curated by the LlamaIndex team, the dataset contains roughly 2,000 human‑verif...

model April 21, 2026

Qwen3.6-35B-A3B Model – Highlights, Benchmarks, and Opportunities

Qwen/Qwen3.6-35B-A3B

Qwen3.6-35B-A3B is a 35‑billion‑parameter causal language model with a vision encoder, released as the first open‑weight variant of the Qwen3.6 series. It features a 262k native context window (extend...

model April 20, 2026

HY‑World 2.0: Open‑Source Multi‑Modal 3D World Generation & Reconstruction

tencent/HY-World-2.0

HY‑World 2.0 is a pioneering multi‑modal world model released by Tencent that bridges text, images, multi‑view photos, and video to realistic 3D scenes. It is advertised as the first open‑source, sta...

model April 19, 2026

MOSS‑TTS‑Nano: Tiny Multilingual Real‑Time TTS for CPU‑Only Apps

OpenMOSS-Team/MOSS-TTS-Nano-100M

MOSS‑TTS‑Nano is an open‑source multilingual text‑to‑speech model released by the OpenMOSS team and MOSI.AI. With only 0.1 B parameters, it targets real‑time speech generation on modest hardware – the...

dataset April 18, 2026

One Million Reasoning Traces: KIMI‑K2.5‑1000000x Dataset

ianncity/KIMI-K2.5-1000000x

The **KIMI‑K2.5‑1000000x** dataset, authored by *ianncity* and released in March 2026, contains one million distilled reasoning traces generated from the KIMI‑K2.5 model on high‑level reasoning tasks....

model April 17, 2026

⚡ Gemma‑4‑31B‑IT NVFP4 Turbo: 68% Smaller, 2.5× Faster Text Generation

LilaRest/gemma-4-31B-it-NVFP4-turbo

LilaRest’s *Gemma 4 31B IT NVFP4 Turbo* is a repackaged, quantized version of Google DeepMind’s Gemma‑4 31B‑IT model. Built on the NVIDIA NVFP4 checkpoint, it quantizes self‑attention weights to FP4 (...

model April 16, 2026

OmniVoice: 600‑Language Zero‑Shot TTS Takes Center Stage

k2-fsa/OmniVoice

OmniVoice is a massively multilingual zero‑shot text‑to‑speech (TTS) model that supports over 600 languages, making it the broadest‑coverage TTS system currently available. Built on a diffusion langua...

model April 15, 2026

GLM-5.1: Multilingual Agentic Coding Model Takes the Lead

zai-org/GLM-5.1

GLM-5.1, released by the ZAI organization, is the latest flagship large language model designed for agentic engineering. It is a multilingual text‑generation model (English and Chinese) built on the T...

dataset April 14, 2026

Hermes Agent Reasoning Traces: Real Tool-Calling Trajectories for AI Agent Training

lambda/hermes-agent-reasoning-traces

The **Hermes Agent Reasoning Traces** dataset, released by lambda, provides multi‑turn tool‑calling trajectories captured from two powerful LLMs: Moonshot AI's Kimi‑K2.5 and ZhipuAI's GLM‑5.1-FP8. Eac...

dataset April 13, 2026

Comprehensive Overview of the Multilingual Dataset

wikimedia/wikipedia

This dataset comprises a vast collection of text data spanning over 600 language configurations, each with its own training split. It includes a wide variety of linguistic resources, ranging from well...

dataset April 12, 2026

Vietnam Real Estate Listings 2025: 1M Records for Price Prediction & Market Insight

tinixai/vietnam-real-estates

The **Tinix Vietnam Real Estate Listings 2025** dataset, curated by TiniX AI, provides a comprehensive snapshot of the Vietnamese property market with exactly **1,000,000** listings collected between ...

dataset April 09, 2026

Redacted Coding Agent Session Traces from pi-mono – A New Dataset for Code Generation Research

badlogicgames/pi-mono

The *badlogicgames/pi-mono* dataset offers a collection of redacted coding‑agent session traces harvested from work on the open‑source *pi-mono* repository (https://github.com/badlogic/pi-mono.git). E...

model April 08, 2026

NVIDIA‑Optimized Gemma‑4 31B IT NVFP4: Fast Multimodal Text Generation

nvidia/Gemma-4-31B-IT-NVFP4

The **Gemma‑4 31B IT NVFP4** model is a quantized version of Google DeepMind's open‑source Gemma‑4 31B IT multimodal transformer. Built on 30.7 B parameters with a 256K‑token context window, it accept...

model April 07, 2026

Gemma 4 E2B‑IT: On‑Device Multimodal Reasoning Model Hits the Spotlight

google/gemma-4-E2B-it

Google DeepMind’s Gemma 4 E2B‑IT is the newest open‑weight multimodal model on Hugging Face, offering 2.3 B effective parameters (5.1 B total with embeddings) and a 128K token context window. Built on...

dataset April 06, 2026

FineWeb: 18.5 T Tokens of High‑Quality English Web Text Now Open

HuggingFaceFW/fineweb

The FineWeb dataset, released by HuggingFaceFW, provides over 18.5 trillion tokens of cleaned and deduplicated English web data sourced from CommonCrawl. Licensed under ODC‑By 1.0, it targets the text...

dataset April 05, 2026

UniSAFE: Multimodal Image‑Text Safety Dataset (Trending)

segyulee/UniSAFE

UniSAFE, authored by segyulee, is a multimodal dataset that pairs images with textual instructions and metadata describing unsafe triggers, target outcomes, and scenario types. It is stored in optimiz...

model April 04, 2026

Gemma‑4 26B A4B: Open‑Weight Multimodal MoE Model for Image‑Text Reasoning

google/gemma-4-26B-A4B-it

Google DeepMind’s Gemma‑4 family expands with the 26B A4B mixture‑of‑experts (MoE) model, released under an Apache‑2.0 license. Identified on Hugging Face as `google/gemma-4-26B-A4B-it` and tagged for...

model April 03, 2026

Gemma‑4 31B‑IT: A Multimodal Reasoning Powerhouse for Images, Video & Text

google/gemma-4-31B-it

The **google/gemma-4-31B-it** model is the instruction‑tuned, 31‑billion‑parameter dense variant of Google DeepMind's Gemma 4 family. Hosted on Hugging Face, it belongs to the *image‑text‑to‑text* pip...

model April 02, 2026

TRIBE v2: Multimodal Brain‑Encoding Model for fMRI Prediction

facebook/tribev2

TRIBE v2 is a foundation multimodal model released by Facebook Research that predicts functional MRI (fMRI) brain responses to naturalistic stimuli across vision, audition, and language. The model int...

dataset April 01, 2026

KIMI-K2.5-450000x: 450K High‑Quality Reasoning Traces for LLM Tuning

ianncity/KIMI-K2.5-450000x

The **KIMI-K2.5-450000x** dataset, authored by *ianncity*, contains 450,000 distilled reasoning traces generated from the KIMI‑K2.5 model under a "high" reasoning setting. With a total token count of ...

dataset March 31, 2026

Claude Opus 4.6 Extended Reasoning Dataset: Traces for Better LLM Reasoning

TeichAI/Claude-Opus-4.6-Reasoning-887x

The **Claude Opus 4.6 Extended Reasoning** dataset, created by TeichAI, is a small (<1 K records) JSON‑formatted collection of reasoning traces generated with Anthropic's Claude Opus 4.6. It aggregat...

model March 30, 2026

Efficient Multilingual Reasoning with Qwen3.5‑9B‑Claude‑Opus Distilled v2 (GGUF)

Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF

The **Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2** model is a second‑generation fine‑tune of the Qwen3.5‑9B base model, distilled from over 14,000 Claude 4.6 Opus‑style reasoning samples. Built...

dataset March 29, 2026

Claude‑Sonnet 4.6 Reasoning Dataset: 799 Deep Thought Conversations

TeichAI/Claude-Sonnet-4.6-Reasoning-799x

The **Claude‑Sonnet‑4.6‑Reasoning‑799x** dataset, authored by TeichAI, contains 799 single‑turn user→assistant exchanges that focus exclusively on chain‑of‑thought reasoning. Each response averages ar...

dataset March 28, 2026

Michael Hafftka Catalog Raisonné: 3.8K Paintings with Rich Metadata

Hafftka/michael-hafftka-catalog-raisonne

The **Michael Hafftka – Catalog Raisonné** dataset is a curated collection of roughly 3,800 digitized paintings by the American expressionist Michael Hafftka, spanning the period from the 1970s throug...

dataset March 27, 2026

AutoMathText-V2 Dataset Overview Report

OpenSQZ/AutoMathText-V2

AutoMathText-V2 is a curated collection of 52 premium data sources spanning web content, mathematics, code, reasoning, formal proofs, and bilingual translation. It aggregates over 1.5 trillion tokens,...

dataset March 26, 2026

Hacker News Dataset Report

open-index/hacker-news

The Hacker News Complete Archive mirrors every item posted on news.ycombinator.com from its inception in October 2006 through the present day, totaling over 47 million records. The data is stored in m...

dataset March 25, 2026

OmniAction: A Massive Omni‑modal Dataset for Proactive Robot Manipulation

OpenMOSS-Team/OmniAction

The OpenMOSS-Team has released **OmniAction**, a large‑scale multimodal dataset designed for contextual instruction following in robotic manipulation. Hosted on HuggingFace, the dataset contains 141,1...

model March 24, 2026

OmniCoder-9B: A 9B Coding Agent Fine‑Tuned on 425K Agentic Trajectories

Tesslate/OmniCoder-9B

OmniCoder-9B is a 9‑billion‑parameter coding agent released by Tesslate and built on top of Qwen3.5‑9B’s hybrid Gated‑Delta/standard‑attention architecture. It has been fine‑tuned with LoRA (r=64, alp...

model March 23, 2026

Foundation-1: Structured Text‑to‑Sample Music Generator Takes Center Stage

RoyalCities/Foundation-1

Foundation-1, released by RoyalCities, is a next‑generation text‑to‑sample model fine‑tuned from Stability AI’s stable‑audio‑open‑1.0. Designed for modern music production, it interprets layered promp...

dataset March 22, 2026

Screen‑Recording Dataset Powers Next‑Gen Desktop AI Agents

markov-ai/computer-use-large

The **Computer Use Large** dataset, released by *markov-ai*, contains 48,478 trimmed screen‑recording videos totalling roughly 12,300 hours of professional software usage. All videos are audio‑free an...

model March 21, 2026

Uncensored Multimodal Power: Qwen3.5-35B-A3B Aggressive Variant

HauhauCS/Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive

The **Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive** model is an uncensored, aggressive fork of the original Qwen3.5‑35B‑A3B released by the community contributor HauhauCS. With over 210 k downloads...

model March 20, 2026

Fish Audio S2 Pro – Multilingual TTS with Fine‑Grained Inline Control

fishaudio/s2-pro

Fish Audio S2 Pro is a state‑of‑the‑art text‑to‑speech (TTS) model released by the Fish Audio research team. It supports more than 80 languages, including tier‑1 coverage for English, Chinese, and Jap...

dataset March 19, 2026

olmOCR-bench: The New Standard for PDF‑to‑Markdown OCR Evaluation

allenai/olmOCR-bench

olmOCR-bench, released by AllenAI, is a benchmark dataset comprising 1,403 PDF files and 7,010 unit test cases that capture the properties a high‑quality OCR system should preserve when converting PDF...

dataset March 18, 2026

BONES-SEED: Massive Multimodal Motion Dataset for Humanoid Robotics

bones-studio/seed

BONES-SEED (Skeletal Everyday Embodiment Dataset) is an open collection of 142,220 annotated human motion captures designed for humanoid robotics research. The dataset provides each motion in three sk...

dataset March 17, 2026

FinePhrase: 1.35B Synthetic Samples for FAQ, Math, Tables & Tutorials

HuggingFaceFW/finephrase

FinePhrase is a massive synthetic dataset created by DataTrove using the SmolLM2-1.7B-Instruct model. It re‑writes source documents from the FineWeb‑Edu corpus into four distinct instructional formats...

model March 16, 2026

NVIDIA Nemotron‑3 Super 120B FP8: Massive Context & Agentic Reasoning Model

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8

NVIDIA's Nemotron‑3 Super 120B‑A12B‑FP8 is a 120‑billion‑parameter large language model (with 12 B active parameters) released on March 11, 2026. Built on the Transformers library and tagged for text‑...

dataset March 15, 2026

Open-RL: Verifiable STEM Reasoning Dataset for Outcome‑Supervised RL

TuringEnterprises/Open-RL

The **Open-RL** dataset, released by **TuringEnterprises** on March 2, 2026, offers a compact collection (<1K entries) of self‑contained, verifiable STEM reasoning problems spanning physics, mathemati...

model March 14, 2026

Uncensored Power: Qwen3.5-9B Aggressive Model Goes Multimodal

HauhauCS/Qwen3.5-9B-Uncensored-HauhauCS-Aggressive

The Qwen3.5-9B-Uncensored-HauhauCS-Aggressive model is a 9 billion‑parameter language model released by HauhauCS that removes all refusal filters from the original Qwen3.5‑9B architecture. According t...

dataset March 13, 2026

DyNativeGaussian_sequence: A Multimodal Text‑3D Dataset Gains Traction

LeeXiangNO1/DyNativeGaussian_sequence

The dataset "LeeXiangNO1/DyNativeGaussian_sequence" is a recently popular multimodal collection authored by LeeXiangNO1. It contains both textual and 3D data, as indicated by the tags "modality:text" ...

dataset March 12, 2026

ALL Bench Leaderboard 2026: The First Unified Multi‑Modal AI Benchmark

FINAL-Bench/ALL-Bench-Leaderboard

The **ALL Bench Leaderboard 2026** dataset, curated by FINAL‑Bench, aggregates benchmark scores for more than 90 AI models across six modalities—LLMs, VLMs, autonomous agents, image generation, video ...

model March 11, 2026

Unsloth’s GGUF‑Quantized Qwen3.5‑35B‑A3B: Vision‑Language Power on a Laptop

unsloth/Qwen3.5-35B-A3B-GGUF

The **unsloth/Qwen3.5-35B-A3B-GGUF** repository provides a GGUF‑quantized checkpoint of the Qwen3.5‑35B‑A3B model, repackaged by the Unsloth community. With a **pipeline tag of `image-text-to-text`**...

model March 10, 2026

UnsLoTh Qwen3.5-9B GGUF Model – Trending Overview

unsloth/Qwen3.5-9B-GGUF

The unsloth/Qwen3.5-9B-GGUF model is a 9‑billion‑parameter multimodal (vision‑language) foundation model quantized to the GGUF format using Unsloth Dynamic 2.0, offering superior accuracy and low‑late...

model March 09, 2026

Phi-4 Reasoning Vision 15B: Multimodal AI with Chain‑of‑Thought Power

microsoft/Phi-4-reasoning-vision-15B

Phi-4-Reasoning-Vision-15B is an open‑weight multimodal model released by Microsoft on March 4, 2026. It combines the Phi‑4‑Reasoning language backbone (5 B–15 B parameters) with a SigLIP‑2 vision enc...

model March 08, 2026

Unslo​th Qwen3.5-4B GGUF – Trending Multimodal LLM (2026-03-08)

unsloth/Qwen3.5-4B-GGUF

The unsloth‑quantized Qwen3.5‑4B GGUF model is a 4‑billion‑parameter causal language model with an integrated vision encoder. It supports a native context length of 262K tokens (extendable to >1M) and...

dataset March 07, 2026

The Stack v2: Massive Multilingual Code Corpus for AI

bigcode/the-stack-v2

The Stack v2, released by the BigCode team, is a gargantuan dataset of source code harvested from over 600 programming languages. Tagged for text-generation tasks, it provides raw code files along wit...

model March 06, 2026

Qwen3.5‑397B‑A17B: Ultra‑Large Multimodal Model Redefines Vision‑Language AI

Qwen/Qwen3.5-397B-A17B

Qwen3.5‑397B‑A17B is a next‑generation multimodal language model released by the Alibaba‑Qwen team. It is an image‑to‑text (image‑text‑to‑text) model built on a causal decoder architecture and equipp...

model March 05, 2026

LocoOperator-4B: A 4B‑Parameter Local Code‑Explorer Agent

LocoreMind/LocoOperator-4B

LocoOperator-4B is a 4 billion‑parameter tool‑calling agent released by LocoreMind. It is built on the Qwen3‑4B‑Instruct‑2507 base model and distilled from the Qwen3‑Coder‑Next teacher using full‑para...

model March 04, 2026

Qwen3.5-122B-A10B – A 122‑Billion‑Parameter Sparse Mixture‑of‑Experts Vision‑Language Model

Qwen/Qwen3.5-122B-A10B

Qwen3.5-122B-A10B is a 122‑billion‑parameter causal language model with a vision encoder that activates only ~10 B parameters per inference via a 256‑expert Mixture‑of‑Experts architecture (8 routed +...

dataset March 03, 2026

CCN Dataset: Tabular Classification for Advanced Route Recommendation

GD-ML/CCN

The GD-ML/CCN dataset, released by the GD-ML team, supports the research paper *Towards Full Candidate Interaction: A Comprehensive Comparison Network for Better Route Recommendation*. It is a tabular...

dataset March 02, 2026

Real Slop: 155k Real LLM Interactions for Dialogue & Safety Research

Solenopsisbot/real-slop

Real Slop is a Hugging Face dataset released by the user Solenopsisbot that aggregates 155,000 real‑world language model interactions in English. The entries span a variety of model families and are s...

dataset March 01, 2026

GitHub Top Developer Source Code: 1.3M+ Files for Code Intelligence

ronantakizawa/github-top-code

The **GitHub Top Developer Source Code** dataset, authored by *ronantakizawa*, aggregates over **1.3 million source code files** contributed by the most highly ranked GitHub developers between 2015 an...

dataset February 28, 2026

Coding Agent Conversations: 549 Sessions of AI Tool Use

peteromallet/dataclaw-peteromallet

The *Coding Agent Conversations* dataset (ID: `peteromallet/dataclaw-peteromallet`) is a collection of 549 logged sessions where large language models act as coding assistants. Each session records me...

dataset February 27, 2026

OpenResearcher Dataset: Structured LLM Interaction Traces for Tool‑Use Research

OpenResearcher/OpenResearcher-Dataset

The OpenResearcher/OpenResearcher-Dataset is a curated collection of 6,102 multi‑turn conversational examples, each tied to a unique question (qid), a reference answer, and a detailed message log. The...

dataset February 26, 2026

Common Corpus: 2.3 T Token Open Multilingual Text Dataset

PleIAs/common_corpus

Common Corpus, released by PleIAs and a network of partners, is currently the largest openly licensed text collection, containing 2.27 trillion tokens across more than a dozen languages. The dataset a...

dataset February 25, 2026

ToolMind-Web-QA: Synthetic Multi‑Hop Web‑Search QA for Long‑Horizon Agents

Nanbeige/ToolMind-Web-QA

ToolMind-Web-QA is a publicly released, synthetic dataset created by Nanbeige for research on search‑augmented and long‑horizon search agents. It contains roughly 6,000 complex question‑answer pairs g...

dataset February 24, 2026

DeepGen 1.0 Image Dataset: Small Yet Powerful Multimodal Training Resource

deepgenteam/DeepGen-1.0

The **deepgenteam/DeepGen-1.0** dataset is a lightweight image collection released by the DeepGen team. Hosted on Hugging Face, it follows the *imagefolder* format, is licensed under Apache‑2.0, and f...

model February 23, 2026

Kimi K2.5: 1‑Trillion‑Parameter Multimodal Agent for Vision‑Language Reasoning

moonshotai/Kimi-K2.5

Kimi K2.5, released by Moonshot AI, is an open‑source, native multimodal model that bridges vision and language through a 1‑trillion‑parameter Mixture‑of‑Experts architecture. Built on top of the Kimi...

dataset February 22, 2026

MolmoSpaces: A Rich Asset Hub for Robotics and Embodied AI

allenai/molmospaces

MolmoSpaces is a dataset released by the Allen Institute for AI (AI2) that bundles asset data for the MolmoSpaces project. It provides a comprehensive collection of 3‑D objects, robot models, scene de...

model February 21, 2026

MiniMax-M2.5 Model Overview and Insights

MiniMaxAI/MiniMax-M2.5

MiniMax-M2.5 is the latest frontier model from MiniMax AI, excelling in coding, agentic tool use, search, and office work. Trained with reinforcement learning across hundreds of thousands of real-worl...

dataset February 20, 2026

Fine-T2I: 6M High‑Quality Text‑Image Pairs for Open T2I Fine‑Tuning

ma-xu/fine-t2i

Fine‑T2I is a large‑scale, open dataset released by Xu Ma, Yitian Zhang, Qihua Dong, and Yun Fu from Northeastern University. It contains over 6.15 million text–image pairs (about 2 TB) organized in W...

model February 19, 2026

Qwen3‑TTS 1.7B CustomVoice: Real‑Time Multilingual Speech with Instruction‑Driven Style

Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Qwen3‑TTS‑12Hz‑1.7B‑CustomVoice is a 1.7 B‑parameter text‑to‑speech model released by the Qwen team. It supports ten major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portu...

dataset February 18, 2026

Chinese-Fineweb-Edu-V2.2 Dataset: Quickstart Guide

opencsg/Fineweb-Edu-Chinese-V2.2

The Chinese-Fineweb-Edu-V2.2 dataset provides high‑quality Chinese educational text for large‑language‑model pre‑training and instruction fine‑tuning. It is organized into three tiers of pre‑training ...

dataset February 17, 2026

Kitchen Robotics: 600 Hours of Human Tele‑Operated Demonstrations

nvidia/PhysicalAI-Robotics-Kitchen-Sim-Demos

PhysicalAI‑Robotics‑Kitchen‑Sim‑Demos is a large‑scale dataset released by NVIDIA that captures 600 hours of human‑teleoperated manipulation in a simulated kitchen environment. The data spans 316 dist...

model February 16, 2026

GLM-5: 744B LLM with Sparse Attention, Tool Use, and Long‑Context Capabilities

zai-org/GLM-5

GLM-5, released by the ZAI organization, is a massive multilingual language model targeting complex systems engineering and long‑horizon agentic tasks. It scales up to 744 B parameters (with 40 B acti...

model February 15, 2026

GLM-OCR: Multilingual, High‑Performance OCR for Complex Documents

zai-org/GLM-OCR

GLM-OCR is a multimodal OCR model built on the GLM‑V encoder‑decoder architecture, integrating the CogViT visual encoder and a lightweight cross‑modal connector with a GLM‑0.5B language decoder. It op...

dataset February 14, 2026

Exploring the Massive International Travel Text Dataset

GD-ML/IntTravel_dataset

The GD-ML/IntTravel_dataset is a large‑scale text collection hosted on the Hugging Face Hub. According to its metadata, the dataset falls in the 100 M < size < 1 B range, is stored in CSV format, and ...

dataset February 13, 2026

Moonworks Lunara Aesthetic II: High‑Quality Image Variation Dataset

moonworks/lunara-aesthetic-image-variations

The **Moonworks Lunara Aesthetic II** dataset, released by the creator *moonworks*, provides 2,854 paired images designed for research on image editing, image‑to‑image generation, and identity preserv...

dataset February 12, 2026

DeepPlanning: Benchmarking Long‑Horizon Agentic Planning with Constraints

Qwen/DeepPlanning

DeepPlanning is a newly released dataset from Qwen that serves as a benchmark for evaluating the long‑horizon planning abilities of large language models (LLMs). It focuses on agentic tasks where mod...

model February 11, 2026

Intern‑S1‑Pro: Trillion‑Scale Multimodal Scientific Reasoner Takes the Lead

internlm/Intern-S1-Pro

Intern‑S1‑Pro, released by the InternLM team, is a trillion‑parameter mixture‑of‑experts (MoE) foundation model that targets scientific multimodal reasoning. Tagged with **image‑text‑to‑text**, it acc...

dataset February 10, 2026

Moltbook Annotated Posts & Submolts: A Rich Resource for Content Classification

TrustAIRLab/Moltbook

The Moltbook Dataset, released by TrustAIRLab, provides over 44,000 GPT‑5.2‑annotated posts and 12,209 submolts harvested from the agent social network Moltbook. Each post is labeled with one of nine ...

model February 09, 2026

NVIDIA Personaplex 7B: Audio‑to‑Audio Model Gains Traction

nvidia/personaplex-7b-v1

The model nvidia/personaplex-7b-v1 is an audio‑to‑audio (speech‑to‑speech) model hosted on Hugging Face. It is built with the Moshi library and distributed in the safetensors format, indicating a focu...

dataset February 08, 2026

TeichAI's Small Text Dataset for Claude‑4.5 Opus Reasoning Gains Traction

TeichAI/claude-4.5-opus-high-reasoning-250x

The dataset **TeichAI/claude-4.5-opus-high-reasoning-250x** is a compact collection of text entries (size category n<1K) stored in JSON format. Created by the user *TeichAI* on November 27, 2025, it h...

model February 07, 2026

Qwen3-Coder-Next: The Trending Code‑Focused Text Generation Model

Qwen/Qwen3-Coder-Next

Qwen/Qwen3-Coder-Next is a newly released transformer model that targets text‑generation tasks, as indicated by its pipeline_tag. The model’s identifier – “Coder‑Next” – suggests a focus on programmin...

dataset February 06, 2026

Tencent's CL-bench: A New Benchmark for Long-Context Text Generation

tencent/CL-bench

The CL-bench dataset, released by Tencent, is a recently trending English-language benchmark designed for text‑generation tasks that require handling long contexts. It contains between 1,000 and 10,00...