Wikit Semantics JSON Format Documentation
Introduction
The Wikit Semantics JSON format is a data structure designed to semantically represent digital documents. Inspired by Schema.org, this format facilitates the integration, processing, and analysis of documentary content in your applications.
This format offers a simple way to represent digital documents for ingestion into LLM applications with a RAG mechanism.
Basic Structure (Document)
A document in Wikit Semantics format is represented by a JSON object with the following properties:
json
{
"@context": "https://wikit.ai",
"@type": "Document",
"identifier": "document-1",
"title": "Titre du document",
"url": "https://source-du-document.com/page",
"hasPart": [
// Array of DocumentChunk fragments
]
}@context: Always set to "https://wikit.ai"@type: Always "Document" for the root objectidentifier: The identifier of the document in the original sourcetitle: The title of the documenturl: The source URL of the documenthasPart: An array containing the document chunks
Document Chunks (DocumentChunk)
Each document chunk is represented by a DocumentChunk object:
json
{
"@type": "DocumentChunk",
"text": "Contenu textuel du fragment"
}@type: Always "DocumentChunk"text: The textual content of the chunk, which may include formatting (e.g., Markdown)
Usage
Document Creation
To create a document in Wikit Semantics format:
- Start with the root object with the properties
@context,@type,title, andurl. - Divide the document content into logical chunks.
- For each chunk, create a DocumentChunk object and add it to the
hasPartarray.
Reading and Processing
To process a Wikit Semantics document:
- Parse the JSON to get the document object.
- Access metadata via the
titleandurlproperties. - Iterate over the
hasPartarray to process each chunk individually.
Code Example (Python)
Here is a simple example of creating and reading a Wikit Semantics document in Python:
python
import json
# Document creation
document = {
"@context": "https://wikit.ai",
"@type": "Document",
"identifier": "document-1",
"title": "Mon Document",
"url": "https://example.com/document",
"hasPart": [
{"@type": "DocumentChunk", "text": "# Introduction\n\nCeci est le premier paragraphe."},
{"@type": "DocumentChunk", "text": "# Chapitre 1\n\nContenu du chapitre 1."},
{"@type": "DocumentChunk", "text": "# Conclusion\n\nRésumé final."}
]
}
# JSON serialization
json_doc = json.dumps(document, indent=2)
print(json_doc)
# Reading and processing
parsed_doc = json.loads(json_doc)
print(f"Titre: {parsed_doc['title']}")
print(f"URL: {parsed_doc['url']}")
for chunk in parsed_doc['hasPart']:
print(f"Fragment: {chunk['text'][:50]}...") # Displays the first 50 charactersBest Practices
- Maintain consistent granularity in document chunking.
- Use consistent formatting (like Markdown) in text chunks.
- Ensure that the provided URL is valid and accessible.
- Always validate the JSON structure before processing.