Skip to content

Document Preparation for Ingestion

Introduction

The quality of your applications using RAG (Retrieval-Augmented Generation) in Wikit Semantics heavily depends on the preparation of your source documents. This guide presents best practices for optimizing your documents before their ingestion into the platform.

Supported Formats

Wikit Semantics supports the following formats:

  • PDF
  • Word (.docx)
  • Plain text (.txt)
  • HTML (.html)
  • Markdown (.md)
  • JSON (.json) in Wikit Semantics format

General Best Practices

Structure and Organization

  • Prioritize clarity: Well-structured documents with titles, subtitles, and paragraphs are better interpreted.
  • Use a logical hierarchy: Organize information coherently with a logical progression.
  • Avoid overly dense documents: Prefer multiple thematic documents rather than a single very long one.
  • Maintain consistent granularity in sections at each heading level: Document fragments after ingestion will have a format more suitable for semantic search and response generation steps.

Content and Formatting

  • Use usable text: Ensure that the text is selectable and not in image form.
  • Avoid multi-column text when possible.
  • Prefer structured formats: HTML and Markdown better preserve structure than scanned PDFs.

Specific Recommendations by Document Type

PDF Documents

  • Ensure the PDF contains searchable text, not images of text.
  • Verify that the table of contents is functional and bookmarks are correctly defined.
  • Optimize file size.

Word Documents

  • Use built-in heading styles for better structure.
  • Add descriptions to images for context.
  • Complete document properties (title, author, keywords).

HTML and Markdown

  • Respect a semantic structure with titles (<h1>, <h2>, etc.) and paragraphs.
  • Use alt attributes for images.
  • Avoid unnecessarily complex HTML code.

Conclusion

Careful preparation of your source documents ensures optimal performance of your RAG applications in Wikit Semantics. Prioritize well-structured documents with usable textual content and logically organized to obtain the best results.