Docling
What is Docling?
- docling is a python library designed by IBM that converts documents into structured data
- simplifies downstream document and AI processing by detecting tables, formulas, reading order, OCR
Install Docling
# Create a folder for your project and move into it
mkdir pdf-to-markdown && cd pdf-to-markdown
# Create the virtual environment
/opt/homebrew/bin/python3.12 -m venv venv
# Activate it
source venv/bin/activate
# Install Dolcing
pip install docling
Use Docling
touch convert.py
nano convert.py
- Paste the follwing inside the file
from docling.document_converter import DocumentConverter
# 1. Initialize the converter
converter = DocumentConverter()
# 2. Specify your PDF file path (or use a URL!)
source = "your_document.pdf"
print("Converting document... Please wait.")
# 3. Convert the document
result = converter.convert(source)
# 4. Extract the markdown string
markdown_output = result.document.export_to_markdown()
# 5. Save the markdown to a file
output_file = "output.md"
with open(output_file, "w", encoding="utf-8") as f:
f.write(markdown_output)
print(f"Success! Saved markdown to {output_file}")
- put a pdf in the same folder named: your_document.pdf
python3 convert.py
Run Docling in Docker
Step 1: Create a requirements.txt File
streamlit
docling
Step 2: Create a Dockerfile
# Use a pre-bundled data science image that includes all Linux graphics/GL drivers out-of-the-box
FROM jupyter/scipy-notebook:latest
# Switch to root to handle file permissions and setup
USER root
WORKDIR /app
# Copy requirements and install python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy your Streamlit app script
COPY app.py .
# Expose port
EXPOSE 8501
# Run streamlit
ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
Step 3: Build the Docker Image
docker build --no-cache -t local-docling-app .
Step 4: Run the Container "Always"
# Clear old container tags
docker stop docling-web || true
docker rm docling-web || true
# Launch the updated app
docker run -d --name docling-web -p 8501:8501 --restart unless-stopped local-docling-app