Microsoft Open Sources the MarkItDown Project to Convert PDFs, Office Documents, Images, and Audio/Video to Markdown Format
Developers have a preference for writing in Markdown format, and now Microsoft has open-sourced a new project called MarkItDown, which can convert a wide range of content into Markdown format using AI.
For instance, conversions can be made from the following formats:
- PowerPoint / PPTX
- Excel / XLSX
- Word / DOCX
- Images / EXIF metadata and OCR
- Audio / EXIF metadata and speech transcription
- HTML / Special handling for Wikipedia and others
- Other text-based formats like CSV, JSON, XML, etc.
For formats like images and audio that cannot be directly converted to text, AI can conveniently be used for tasks such as optical recognition using EXIF metadata and OCR for images, and AI for transcribing voice from audio to text.
So, what's the use of this project? Essentially, it helps developers convert a multitude of files in various formats into Markdown, facilitating subsequent indexing and text analysis. It indeed has practical applications.
The project is open-sourced under the MIT license. Developers interested can access the project here: https://github.com/microsoft/markitdown
Below is a simple guide on how to operate it:
You can install using pip: pip install markitdown
Or install from source: pip install -e .
The API usage is also very straightforward:
from markitdown import MarkItDown markitdown = MarkItDown() result = markitdown.convert("test.xlsx") print(result.text_content)
It's also possible to describe images with large language models, in which case you'll need to provide the model client and parameters, etc.
from markitdown import MarkItDown from openai import OpenAI client = OpenAI() md = MarkItDown(mlm_client=client, mlm_model="gpt-4o") result = md.convert("example.jpg") print(result.text_content)