README.md
<a href="https://trendshift.io/repositories/13944" target="_blank"></a>
A powerful tool for creating fine-tuning datasets for Large Language Models
็ฎไฝไธญๆ | English | Tรผrkรงe
Features โข Quick Start โข Documentation โข Contributing โข License
If you like this project, please give it a Starโญ๏ธ, or buy the author a coffee => Donate โค๏ธ!
</div>Easy Dataset is an application specifically designed for building large language model (LLM) datasets. It features an intuitive interface, along with built-in powerful document parsing tools, intelligent segmentation algorithms, data cleaning and augmentation capabilities. The application can convert domain-specific documents in various formats into high-quality structured datasets, which are applicable to scenarios such as model fine-tuning, retrieval-augmented generation (RAG), and model performance evaluation.
๐๐ Easy Dataset Version 1.7.0 launches brand-new evaluation capabilities! You can effortlessly convert domain-specific documents into evaluation datasets (test sets) and automatically run multi-dimensional evaluation tasks. Additionally, it comes with a human blind test system, enabling you to easily meet needs such as vertical domain model evaluation, post-fine-tuning model performance assessment, and RAG recall rate evaluation. Tutorial: https://www.bilibili.com/video/BV1CRrVB7Eb4/
https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8
<b>Setup.exe</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<b>Intel</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<b>M</b>
</a>
</td>
<td align="center" valign="middle">
<a href='https://github.com/ConardLi/easy-dataset/releases/latest'>
<b>AppImage</b>
</a>
</td>
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
npm install
npm run build
npm run start
http://localhost:1717git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
docker-compose.yml file:services:
easy-dataset:
image: ghcr.io/conardli/easy-dataset
container_name: easy-dataset
ports:
- '1717:1717'
volumes:
- ./local-db:/app/local-db
- ./prisma:/app/prisma
restart: unless-stopped
Note: It is recommended to use the
local-dbandprismafolders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.
Note: The database file will be automatically initialized on first startup, no need to manually run
npm run db:push.
docker-compose up -d
http://localhost:1717If you want to build the image yourself, use the Dockerfile in the project root directory:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
docker build -t easy-dataset .
docker run -d \
-p 1717:1717 \
-v ./local-db:/app/local-db \
-v ./prisma:/app/prisma \
--name easy-dataset \
easy-dataset
Note: It is recommended to use the
local-dbandprismafolders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.
Note: The database file will be automatically initialized on first startup, no need to manually run
npm run db:push.
http://localhost:1717We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:
git checkout -b feature/amazing-feature)git commit -m 'Add some amazing feature')git push origin feature/amazing-feature)Please ensure that tests are appropriately updated and adhere to the existing coding style.
https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men
This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.
If this work is helpful, please kindly cite as:
@misc{miao2025easydataset,
title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
year={2025},
eprint={2507.04009},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.04009}
}