Skip to content

Wikit Semantics JSON Format Documentation

Introduction

The Wikit Semantics JSON format is a data structure designed to semantically represent digital documents. Inspired by Schema.org, this format facilitates the integration, processing, and analysis of documentary content in your applications.

This format offers a simple way to represent digital documents for ingestion into LLM applications with a RAG mechanism.

Basic Structure (Document)

A document in Wikit Semantics format is represented by a JSON object with the following properties:

json
{
    "@context": "https://wikit.ai",
    "@type": "Document",
    "identifier": "document-1",
    "title": "Titre du document",
    "url": "https://source-du-document.com/page",
    "hasPart": [
        // Array of DocumentChunk fragments
    ]
}
  • @context: Always set to "https://wikit.ai"
  • @type: Always "Document" for the root object
  • identifier: The identifier of the document in the original source
  • title: The title of the document
  • url: The source URL of the document
  • hasPart: An array containing the document chunks

Document Chunks (DocumentChunk)

Each document chunk is represented by a DocumentChunk object:

json
{
    "@type": "DocumentChunk",
    "text": "Contenu textuel du fragment"
}
  • @type: Always "DocumentChunk"
  • text: The textual content of the chunk, which may include formatting (e.g., Markdown)

Usage

Document Creation

To create a document in Wikit Semantics format:

  1. Start with the root object with the properties @context, @type, title, and url.
  2. Divide the document content into logical chunks.
  3. For each chunk, create a DocumentChunk object and add it to the hasPart array.

Reading and Processing

To process a Wikit Semantics document:

  1. Parse the JSON to get the document object.
  2. Access metadata via the title and url properties.
  3. Iterate over the hasPart array to process each chunk individually.

Code Example (Python)

Here is a simple example of creating and reading a Wikit Semantics document in Python:

python
import json

# Document creation
document = {
    "@context": "https://wikit.ai",
    "@type": "Document",
    "identifier": "document-1",
    "title": "Mon Document",
    "url": "https://example.com/document",
    "hasPart": [
        {"@type": "DocumentChunk", "text": "# Introduction\n\nCeci est le premier paragraphe."},
        {"@type": "DocumentChunk", "text": "# Chapitre 1\n\nContenu du chapitre 1."},
        {"@type": "DocumentChunk", "text": "# Conclusion\n\nRésumé final."}
    ]
}

# JSON serialization
json_doc = json.dumps(document, indent=2)
print(json_doc)

# Reading and processing
parsed_doc = json.loads(json_doc)
print(f"Titre: {parsed_doc['title']}")
print(f"URL: {parsed_doc['url']}")
for chunk in parsed_doc['hasPart']:
    print(f"Fragment: {chunk['text'][:50]}...")  # Displays the first 50 characters

Best Practices

  1. Maintain consistent granularity in document chunking.
  2. Use consistent formatting (like Markdown) in text chunks.
  3. Ensure that the provided URL is valid and accessible.
  4. Always validate the JSON structure before processing.