Small but Mighty: How dots.ocr is Revolutionizing Document AI

Debasish
Aug 19
7 min read

Modern document understanding has progressed rapidly thanks to deep learning, yet most high‑accuracy optical character recognition (OCR) pipelines remain complex and heavy. They typically combine separate models for layout detection, text recognition, reading‑order prediction and formula parsing.

This fragmentation adds computational overhead and makes tuning tricky. In July 2025 a team at Xiaohongshu’s HI Lab introduced dots.ocr, a multilingual vision‑language model that unifies all of these steps in a single transformer with only 1.7 billion parameters.

Even though the model is relatively small, it outperforms or matches systems built on massive language‑vision models with tens of billions of parameters. This post unpacks the architecture, benchmarks and practical considerations of dots.ocr.

Why a Unified Vision‑Language Model Matters

Traditional OCR frameworks often chain a layout detector (e.g., YOLO or DocStruct) with a recognizer like Tesseract or a transformer and then apply heuristics for reading order. Such pipelines are brittle because each component introduces errors and misalignment between bounding boxes and text tokens. Dots.ocr tackles the problem differently.

It is a single vision‑language model that can switch between tasks—layout detection, text‑only extraction, bounding‑box grounding and formula or table parsing—by simply changing the input prompt. There is no need to orchestrate multiple models or align outputs; the VLM handles the entire task from end to end. Its compact size (1.7 billion parameters) means it fits on commodity GPUs, yet its architecture is flexible enough to support multilingual text across more than 100 languages.

Unified modelling brings several advantages:

Simpler deployment – Prompt‑based task switching eliminates the need for custom inference scripts for every document type. You can handle layout detection, text extraction and bounding‑box queries with the same model.
Consistent alignment – Because the same network sees the image and generates the text, there is no mismatch between bounding boxes and recognized content.
Efficiency – The 1.7 B parameter model achieves faster inference compared with pipeline systems built on larger foundations.
Multilingual robustness – Dots.ocr’s training includes documents in 100+ languages, allowing it to parse low‑resource scripts like Tibetan or Kannada nearly as well as English and Chinese.

Architecture Overview

According to the official repository, dots.ocr uses a single transformer‑based vision‑language model. The architecture integrates a vision encoder that processes the document image and a language model that generates structured outputs. Key characteristics include:

Unified network with prompt‑driven tasks – The model’s prompt defines the task. For layout detection you can use the prompt_layout_only_en; to extract text alone, use prompt_ocr; to produce JSON with bounding boxes and categories, you can craft your own prompt. This design reduces architectural complexity.

Compact parameter count – At 1.7 B parameters, dots.ocr is significantly smaller than general VLMs like Gemini‑2.5 or Qwen‑72B, yet the authors show that it achieves comparable or better performance on document parsing benchmarks. Smaller size means lower memory footprints and faster inference speeds.

Multimodal input handling – The model accepts single images or entire PDFs. A preprocessing pipeline upsamples images to 200 DPI when necessary and normalizes input sizes, ensuring clarity on low‑resolution scans.

Output formats – Dots.ocr generates structured JSON with bounding boxes, categories and extracted text, plus Markdown and HTML for table contents. The JSON output preserves reading order, table boundaries and formula regions, making integration into downstream applications straightforward.

Installation and Usage

Running dots.ocr is relatively straightforward. The official quick‑start instructions recommend creating a new conda environment, cloning the repository and installing PyTorch and the model’s dependencies:

conda create -n dots_ocr python=3.12
conda activate dots_ocr

# download repository and install dependencies
git clone https://github.com/rednote-hilab/dots.ocr.git
cd dots.ocr
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .

After installation, download the model weights via python3 tools/download_model.py. For inference you can either use vLLM, which the authors recommend for speed, or the Hugging Face Transformers API. The vLLM server is started with:

CUDA_VISIBLE_DEVICES=0 vllm serve ${hf_model_path} --tensor-parallel-size 1 --gpu-memory-utilization 0.95 --chat-template-content-format string --served-model-name model --trust-remote-code

To parse documents, call the provided parser script:

Full layout and recognition: python3 dots_ocr/parser.py my_image.jpg
Layout detection only: python3 dots_ocr/parser.py my_image.jpg --prompt prompt_layout_only_en
Text‑only OCR (skip headers/footers): python3 dots_ocr/parser.py my_image.jpg --prompt prompt_ocr
Grounding (limit to a region): python3 dots_ocr/parser.py my_image.jpg --prompt prompt_grounding_ocr --bbox x1 y1 x2 y2.

Each run outputs a JSON file with bounding boxes and categories, a Markdown file containing the recognized text and a visualization image showing the detected layout. The same commands work for PDFs; just pass a file path instead of an image and optionally increase the number of threads for multi‑page files.

Benchmark Highlights

Dots.ocr’s authors evaluate the model on several public and internal benchmarks to demonstrate its capabilities.

OmniDocBench – End‑to‑End Parsing

OmniDocBench is a widely used benchmark for assessing end‑to‑end document parsing performance across text, formulas, tables and reading order. On this benchmark, dots.ocr outperforms many models, including those with 20× more parameters. Notable results include:

Text recognition: On the English dataset, the average edit distance drops to 0.032, and on Chinese it is 0.066. Lower scores indicate fewer errors. These figures beat larger general‑purpose VLMs like GPT‑4o, Mistral and MonkeyOCR‑Pro.
Formula detection: Dots.ocr’s formula recognition performance is on par with much larger models such as Gemini 2.5 Pro.
Table understanding: With a TableTEDS (structure similarity) score of 88.6 % for English and 89.0 % for Chinese, dots.ocr leads all tested methods.
Reading order: The model yields the lowest reading‑order error among compared systems.

These results illustrate that a 1.7 B parameter model can be extremely competitive when designed specifically for document parsing.

Internal Multilingual Benchmark (dots.ocr‑bench)

The researchers built an in‑house benchmark containing 1,493 PDFs spanning 100 languages. Dots.ocr slashed error rates compared to Doubao and MonkeyOCR; its end‑to‑end error is 0.177, with a text edit distance of 0.075 and formula error of 0.297. The model maintained high table reconstruction scores (TEDS 79.2 %) and low reading‑order errors. These results showcase the model’s multilingual robustness.

Layout Detection vs. DocLayout‑YOLO

DocLayout‑YOLO is a specialized layout detector used by many pipeline systems. Despite not being a dedicated detection model, dots.ocr surpasses DocLayout‑YOLO. On the DocStructBench dataset, dots.ocr obtains a F1 score of 0.93 at IoU ≥ 0.50, compared with 0.80 for YOLO; for formula regions the F1 is 0.832 vs 0.620. This confirms that the unified VLM can match or exceed conventional detectors.

OLMOCR‑bench

The OLMOCR benchmark contains challenging documents—noisy scans, math‑heavy journals, documents with multiple columns and complex footers. Dots.ocr achieves an overall score of 79.1, beating MonkeyOCR‑Pro‑3B (75.8) and surpassing a variety of general VLMs like GPT‑4o and Gemini Flash 2. It particularly excels at parsing multi‑column pages and old scans.

Limitations and Known Issues

While dots.ocr is impressive, it is not perfect. The authors highlight several limitations and offer guidance:

High‑resolution images: Inputs with more than 11,289,600 pixels may cause failure or misalignment; downsample high‑resolution pages or set the parsing DPI to 200.
Special characters: Consecutive ellipses (...) and underscores (__) can trigger repetition bugs. Using alternative prompts like prompt_layout_only_en or prompt_ocr can mitigate the issue.
Pictures are not parsed: The model currently does not extract content from embedded images or figures, focusing instead on text, tables and formulas.
Throughput: Because the model outputs large JSON structures, bulk PDF ingestion can be slow. The authors suggest using vLLM for improved throughput and note that future versions will optimize large‑scale processing.

Understanding these constraints will help practitioners decide when dots.ocr is appropriate and how to work around its shortcomings.

How Dots.ocr Compares to Other Models

A central question is how a 1.7 B model can outperform much larger systems. The answer lies in task specificity and efficient architecture. General VLMs like GPT‑4o and Qwen‑72B are trained for open‑ended conversation and image understanding. They excel at captioning and reasoning but have not been specialized for document parsing.

Traditional OCR pipelines use detection and recognition models optimized separately, which can accumulate errors.

By contrast, dots.ocr focuses on the narrow domain of document parsing. Its vision encoder is tuned to handle scanned pages, and the language decoder has been trained to output structured layouts.

This specialization allows it to achieve lower error rates across text, tables, formulas and reading order compared with general VLMs and pipeline‑based methods. In the OmniDocBench results, dots.ocr’s overall edit distance of 0.125 (English) and 0.160 (Chinese) is the lowest among all expert VLMs listed.

Practical Applications

Dots.ocr is well‑suited for scenarios where high accuracy and multilingual support are critical but computational resources are limited. Potential applications include:

Application	Why dots.ocr is Suitable
Digitizing multilingual archives	The model handles 100+ languages and preserves reading order, making it ideal for digitizing libraries, government records and academic journals that include non‑Latin scripts.
Financial and legal document processing	Dots.ocr’s table and formula extraction capabilities enable accurate parsing of invoices, contracts and reports. It outperforms specialized systems on table TEDS scores.
Education and research	Scholars working with scanned textbooks, exam papers or mathematical publications can leverage the model’s strong performance on noisy scans and complex formulas.
Enterprise automation	Integration with vLLM and Transformers allows developers to embed dots.ocr into workflow systems, delivering structured JSON outputs for downstream natural language processing or database ingestion.

Because it is released under the MIT license, organizations can build commercial applications around dots.ocr without restrictive licensing.

Future Directions

The developers plan to improve performance on complex tables and formulas, extend the model to parse pictures, and optimize throughput for large‑volume PDF processing.

They also envision a general perception model that combines detection, captioning and OCR tasks in a single VLM. Community contributions are welcome; interested researchers can contact the team via the repository’s listed email.

Final Thoughts

Dots.ocr demonstrates that bigger isn’t always better in document AI. By designing a unified vision‑language model tailored for document parsing, the developers achieve state‑of‑the‑art performance with just 1.7 B parameters.

The system supports 100+ languages, accurately extracts text, tables and formulas, and preserves reading order. Its prompt‑based interface simplifies deployment and opens the door to flexible document analysis pipelines. While there are still challenges—particularly around high‑resolution images, special characters and embedded pictures—dots.ocr represents a significant step toward lightweight, multilingual document intelligence. For researchers and practitioners looking to streamline their OCR workflows, this open‑source model is well worth exploring.