Table of Contents

What's Inside a DoclingDocument

The previous page explained that your file is first converted into a DoclingDocument tree. This page walks through what's actually in that tree.

Think of DoclingDocument as Word's "Outline View" turned into something you can manipulate from code. It throws away visual representation (fonts, colors, margins) and keeps only structure (titles, paragraphs, tables, lists).


Same data, two views

A DoclingDocument keeps the same content in two parallel views at the same time.

flowchart TB
    subgraph Tree["Tree view (reading order)"]
        Body["Body (root)"]
        Body --> H1["SectionHeaderItem<br/>'Chapter 1: Intro'"]
        H1 --> P1["TextItem<br/>'Body text...'"]
        H1 --> H2["SectionHeaderItem<br/>'1.1 Background'"]
        H2 --> P2["TextItem<br/>'Background body...'"]
    end

    subgraph Flat["Flat lists (indexed by type)"]
        Texts["doc.Texts[0..3]<br/>= [H1, P1, H2, P2]"]
        Tables["doc.Tables[]"]
        Pictures["doc.Pictures[]"]
        Groups["doc.Groups[]"]
    end
  • Tree view: parent/child relationships in reading order. Markdown serialization walks this tree depth-first.
  • Flat lists: the same elements, indexed by type (doc.Texts, doc.Tables, doc.Pictures, doc.Groups). Useful for "how many tables?" or "give me the n-th picture" lookups.

These are not two copies of the data โ€” the tree node and the flat-list entry are the same instance, referenced from two places.


Element types

Each type represents a different kind of meaningful unit.

TitleItem โ€” document title

The main document title (Word's "Title" style; markdown #).

doc.AddTitle("2026 Business Plan");
// โ†’ "# 2026 Business Plan"

SectionHeaderItem โ€” section / subsection heading

Word's "Heading 1", "Heading 2", PowerPoint slide titles, Excel sheet names โ€” they all map here.

doc.AddHeading("Chapter 1: Intro", level: 1);   // โ†’ "## Chapter 1: Intro"
doc.AddHeading("1.1 Background",   level: 2);   // โ†’ "### 1.1 Background"

Level starts at 1. The number of # characters at output is Level + 1 (Title takes #, so headings start at ##), capped at 6.

TextItem โ€” a regular paragraph

A single paragraph of body text.

doc.AddParagraph("This document outlines the company's 2026 direction.");

Label lets you classify it further (Paragraph, Text, Title, SectionHeader, ...). Plain body paragraphs use DocItemLabel.Paragraph.

DocListItem โ€” one list entry

A single bullet point or numbered item.

doc.AddListItem("First item");                                    // โ†’ "- First item"
doc.AddListItem("First item", enumerated: true, marker: "1.");    // โ†’ "1. First item"

A node represents one entry, not the whole list. Multiple DocListItems under the same parent render as one continuous list.

TableItem + TableData + TableCell โ€” a table

TableItem is the table itself, TableData holds rows/cols metadata, TableCell is each cell.

var data = new TableData { NumRows = 2, NumCols = 3 };
data.TableCells.Add(new TableCell {
    Text = "header1",
    StartRowOffsetIdx = 0, EndRowOffsetIdx = 1,
    StartColOffsetIdx = 0, EndColOffsetIdx = 1,
    ColumnHeader = true,
});
// ... remaining cells
doc.AddTable(data);

RowSpan/ColSpan plus the offset indices express merged cells.

If any cell has ColumnHeader = true, that row is treated as a header. If no cells are explicitly marked, GridTableSerializer falls back to treating the first row as the header.

Note: The HWP parser only respects the author's explicit TitleCell flag. HWP documents commonly use left-column headers (where the first column is the header, not the first row), so positional heuristics would misclassify those layouts. Office parsers default to "first row is header" because that's the dominant convention there.

PictureItem โ€” image placeholder

doc.AddPicture();
// โ†’ "<!-- image -->" (default placeholder, configurable via MarkdownSerializer.ImagePlaceholder)

Current loaders insert placeholders without extracting the image binary.

GroupItem โ€” container

Groups child elements together. Has no visual output of its own โ€” only its children render.

Uses:

  • PowerPoint: each slide is a GroupItem (label = Slide)
  • Excel: each sheet is a GroupItem (label = Sheet)
  • General: any logical grouping (chapters, list groups, etc.)
var slideGroup = doc.AddGroup("Slide 1", GroupLabel.Slide);
doc.AddHeading("Slide title", 2, slideGroup);
doc.AddParagraph("Slide body", slideGroup);

This makes "chunk by slide" a one-liner: doc.Groups.Where(g => g.Label == GroupLabel.Slide).

CodeItem, FormulaItem โ€” code blocks and math (rare)

doc.AddCode("print('hello')", language: "python");
// โ†’ ```python
//   print('hello')
//   ```

Formulas render as $$...$$ (block LaTeX). Office documents rarely produce these; they're more common from PDF or markdown sources.


Per-format mapping table

How each parser maps source elements to DoclingDocument nodes:

Source element Word Excel PowerPoint HWP
Document title "Title" style โ†’ TitleItem โ€” โ€” โ€”
Heading "Heading1~9" โ†’ SectionHeaderItem sheet name โ†’ SectionHeaderItem (level 2) title placeholder โ†’ SectionHeaderItem (level 2) "๊ฐœ์š” 1~9" โ†’ SectionHeaderItem
Paragraph regular paragraph โ†’ TextItem โ€” text-box paragraph โ†’ TextItem regular paragraph โ†’ TextItem
List numbered paragraph โ†’ DocListItem โ€” bullet/autonum paragraph โ†’ DocListItem list paragraph โ†’ DocListItem
Table <w:tbl> โ†’ TableItem (incl. merged cells) whole sheet โ†’ one TableItem <a:tbl> โ†’ TableItem ControlTable โ†’ TableItem
Image (not supported yet) โ€” (not supported yet) (not supported yet)
Container โ€” each sheet โ†’ GroupItem(Sheet) each slide โ†’ GroupItem(Slide) โ€”

What's a RefItem? (read only if needed)

Parent/child relationships in the tree don't use direct object references โ€” they use RefItem, a string-pointer.

item.Parent = new RefItem("#/body");                // parent is the body root
parent.Children.Add(new RefItem("#/texts/0"));      // first text item

This format follows JSON Pointer syntax (a docling convention).

Why strings instead of object references?

  • The whole DoclingDocument should serialize/deserialize as JSON cleanly (no cycles)
  • Stable identifiers are needed when crossing process / file / API boundaries

Call refItem.Resolve(doc) to get the actual object back.