IL3D

Abstract

In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.

Data Analysis

Room Type

The IL3D dataset contains over 27,000 samples, covering 18 room types. Among these samples, synthetic data accounts for approximately 20-30% of the total. This not only strategically expands the coverage of indoor scene categories but also fosters the robustness of LLM-driven scene generation and other visual tasks by reducing potential biases in model training.

Object Numbers per Room

The number of objects in rooms within the dataset mainly ranges from 4 to 9. Rooms containing 5 objects are the most common, with a count of 3,940; this number primarily corresponds to the most typical indoor room types, such as living rooms and bedrooms.

Room Area Distribution

Most room types exhibit a multi-peak feature, reflecting the standardization of architectural design and the diversity of functional requirements. For example, the distribution of bathrooms and kitchens shows a significant peak around 5-15 square meters, embodying their compact design that prioritizes practicality; in contrast, larger spaces such as living rooms and garages have a wider distribution, with peaks exceeding 20 square meters, adapting to diverse furniture layouts and vehicle storage needs.

Navigable Area Distribution

Navigable area distributions are shifted leftward compared to the total area, with reduced peak density and a narrowed range, highlighting the significant impact of fixed facilities, furniture, and built-in elements on non-navigable areas. For instance, the navigable area of kitchens and dining rooms is reduced by approximately 20-30% compared to their total area, which may be attributed to cabinets and appliances; meanwhile, open spaces such as atriums and entertainment rooms maintain a high proportion of navigable area.

Asset Category

The categories quantity distribution of various object categories, sorted in descending order of quantity, highlighting the dominant status of certain objects in the dataset. For example, the "Others" category—comprising daily miscellaneous items such as seats and carpets - ranks first with more than 4,000 instances, reflecting the universality and functional necessity of these objects in daily indoor spaces. It is followed by furniture categories such as sofas and tables, whose quantities gradually decrease from common to rare types until approaching zero. This long-tailed distribution indicates that the dataset captures the frequency pattern of objects in the real world: a small number of high-frequency objects (e.g., seating and storage items) dominate, while a large number of low-frequency objects (e.g., specific decorative items) contribute to diversity. This characteristic embodies the hierarchical and personalized features of indoor design.

Bounding Box Diagonal Length Distribution

Small categories, such as fresh fruits, French fries and food measuring tools, exhibit narrow distributions, indicating their consistent and compact dimensions, which are typical of desktop or handheld items. In contrast, large furniture categories (e.g., sofas, beds, and cabinets) display wide distribution ranges, reflecting size variations from compact to spacious designs. This characteristic is critical for 3D scene understanding algorithms, as it enables them to handle spatial layouts of multiple scales.

Bounding Box Volume Distribution

Categories with high morphological variability exhibit broad volume distributions, which reflects the impact of modular or customizable designs on occlusion and interaction modeling. In contrast, categories such as beverages, computers, and musical instruments have narrow volume distributions, emphasizing their standardized dimensions in the real world. This standardization facilitates accurate semantic segmentation and functional prediction in indoor 3D perception systems.

LLM-Driven 3D Scene Generation

The scene generation process was divided into two stages: 3D asset retrieval and scene layout generation. To study the influence of natural language annotations on the spatial reasoning ability of LLMs, we adopted a "retrieval-then-generation" strategy—first retrieving object information in the scene based on text descriptions, then using the SFT-tuned LLM for reasoning and generation.

Main Results

Main experimental results: Comparison of performance in objective and subjective metrics across I-Design, Holodeck and Qwen3-14B (Supervised Fine-Tuning on IL3D).

Ablation Study

Left: Comparison of performance in objective indicators for Supervised Fine-Tuning with Qwen3-1.7B on datasets of different scales. Right: Comparison of performance in objective indicators for Supervised Fine-Tuning of different Qwen3 models on the IL3D dataset.

Generation Results

Bedroom

Prompt: "A queen-sized oak bed with a floral headboard sits against the far wall, flanked by two white nightstands with brass lamps. A tall wooden dresser with a mirror stands in the corner, while a small reading chair with an ottoman occupies the space near the window."

Livingroom

Prompt: "A sleek sofa with clean lines sits against one wall, facing a minimalist TV mounted on the opposite wall. A black coffee table with a glass top sits on a textured rug, while a gray armchair, a side table with a modern lamp, and a potted orchid add sophisticated touches."

Kitchen

Prompt: "The minimalist kitchen has white cabinetry with touch-latch doors, a hidden refrigerator, and induction cooktop. A sleek island with an integrated sink provides workspace, while a toaster, electric kettle, and two minimalist bar stools maintain the clean look."

Bathroom

Prompt: "The tiny bathroom in a studio apartment contains a corner shower with a clear curtain, a compact toilet, and a wall-mounted sink with a small shelf. A mirrored medicine cabinet provides storage, while a towel hook, toilet paper holder, and soap dispenser save space."

Balcony

Prompt: "A plant-filled balcony has a wooden railing draped with flowering vines, surrounding a small metal bench. Some potted plants of various sizes line the floor, while a hanging macramé plant holder suspends a fern from the ceiling, and a small watering can sits in the corner."

Diningroom

Prompt: "A modern dining area showcases a rectangular table with a light brown top and sleek metal legs, accompanied by several chairs. A chandelier hangs above the table. Along one wall, a dark-toned credenza with a smooth surface provides storage, while a small potted plant rests in the corner."

Text Readability

The text representation of Universal Scene Description (USD) organizes objects and their transformations (e.g., translation and scaling attributes) in indoor scenes via an ASCII-based structure, ensuring high readability. This feature enables direct editing of object positions and dimensions, facilitating debugging and collaborative design workflows. The corresponding 3D visualization aligns precisely with the text data, demonstrating how the structured format of USD supports the accuracy of spatial mapping and rapid prototyping in the development of 3D scenes, particularly in complex indoor environments. Beyond boosting the efficiency of manual editing and debugging, this readability also forms a foundation for LLM-driven 3D scene generation. By parsing USD’s hierarchical text, LLMs can interpret object relationships and transformation parameters to generate or optimize scene layouts. For example, guided by natural language instructions, LLMs can dynamically adjust object positions or introduce new elements, leveraging USD’s structure to enable automated scene design.

Multimodal Data Export

IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks.

Color Image

Depth Map

Normal Map

Semantic Mask

BibTeX

@article{zhou2025il3d,
  title={IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation},
  author={Zhou, Wenxu and Nie, Kaixuan and Du, Hang and Yin, Dong and Huang, Wei and Guo, Siqiang and Zhang, Xiaobo and Hu, Pengbo},
  journal={arXiv preprint arXiv:2510.12095},
  year={2025}
}

IL3D: A Large-scale indoor layout dataset for LLM-Driven 3D Scene Generation