Source Document Management

Comprehensive guide to managing source documents in the Lexi metadata database and search platform.

Overview

Source documents contain comprehensive metadata about processed objects and are stored within collections in the Lexi metadata database. They serve as the foundation for document search, content analysis, and metadata management, providing detailed information about document content, structure, terms, and semantic data.

Source documents are managed via the Lexi server API at [http|https]://[hostname]:[port]/v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents and support comprehensive operations including document upload, metadata retrieval, content analysis, and search functionality.

Source Document Object Structure

Source documents have the following structure:

{
    "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
    "TenantGUID": "default",
    "BucketGUID": "example-data-bucket",
    "CollectionGUID": "default",
    "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
    "ObjectKey": "5.pdf",
    "ObjectVersion": "1",
    "ContentType": "application/pdf",
    "DocumentType": "Pdf",
    "SourceUrl": "http://dcc249eaaf06:8001/v1.0/tenants/default/buckets/example-data-bucket/objects/5.pdf",
    "ContentLength": 31811,
    "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
    "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
    "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
    "CreatedUtc": "2024-10-25T14:14:22.000000Z"
}

Field Descriptions

  • GUID (GUID): Globally unique identifier for the source document object
  • TenantGUID (GUID): Globally unique identifier for the tenant that owns this document
  • BucketGUID (GUID): Globally unique identifier for the bucket where the source object is stored
  • DataRepositoryGUID (GUID): Globally unique identifier for the data repository where the object is stored
  • CollectionGUID (GUID): Globally unique identifier for the collection containing this document
  • ObjectGUID (GUID): Globally unique identifier for the source object
  • ObjectKey (string): Key/name of the source object
  • ObjectVersion (string): Version identifier of the source object
  • ContentType (string): MIME content type of the source document
  • DocumentType (enum): Type of document (e.g., "Pdf", "Text", "Json", "Html")
  • SourceUrl (string): URL from which the source object can be retrieved
  • ContentLength (long): Length of the source document content in bytes
  • MD5Hash (string): MD5 hash of the document content as a hexadecimal string
  • SHA1Hash (string): SHA1 hash of the document content as a hexadecimal string
  • SHA256Hash (string): SHA256 hash of the document content as a hexadecimal string
  • CreatedUtc (datetime): Timestamp indicating when the source document was created, in UTC time

Create Source Document

Uploads a source document to a collection using PUT /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents. Requires a fully-populated source document with comprehensive metadata and UDR (Unified Document Representation) data for content analysis and search functionality.

Request Parameters

Required Parameters

  • TenantGUID (GUID, Body, Required): GUID of the tenant
  • CollectionGUID (GUID, Body, Required): GUID of the target collection
  • ObjectKey (string, Body, Required): Key/name of the source object
  • ObjectVersion (string, Body, Required): Version identifier of the source object
  • ObjectGUID (GUID, Body, Required): GUID of the source object
  • ContentType (string, Body, Required): MIME content type of the document
  • DocumentType (enum, Body, Required): Type of document (e.g., "JSON", "Pdf", "Text")
  • SourceUrl (string, Body, Required): URL from which the source object can be retrieved
  • UdrDocument (object, Body, Required): Unified Document Representation containing processed content, terms, and metadata

Optional Parameters

  • BucketGUID (GUID, Body, Optional): GUID of the bucket where the source object is stored
  • DataRepositoryGUID (GUID, Body, Optional): GUID of the data repository
  • ContentLength (long, Body, Optional): Length of the document content in bytes
  • MD5Hash (string, Body, Optional): MD5 hash of the document content
  • SHA1Hash (string, Body, Optional): SHA1 hash of the document content
  • SHA256Hash (string, Body, Optional): SHA256 hash of the document content

Response

Returns the created source document object with all metadata and processing information:

{
    "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
    "TenantGUID": "default",
    "BucketGUID": "example-data-bucket",
    "CollectionGUID": "default",
    "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
    "ObjectKey": "blake.json",
    "ObjectVersion": "1",
    "ContentType": "application/json",
    "DocumentType": "JSON",
    "SourceUrl": "http://localhost:9000/tenants/default/buckets/data/objects/sample.json",
    "ContentLength": 1024,
    "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
    "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
    "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
    "CreatedUtc": "2024-10-25T14:14:22.000000Z"
}

Read Source Document

Retrieves a specific source document by GUID using GET /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents/[document-guid]. Returns the complete source document with metadata, content information, and processing details.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection containing the document
  • document-guid (string, Path, Required): GUID of the source document to retrieve
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/0000000000000/documents/fd937de1-480a-4db8-9025-c7ac0bd8d66c' \
--header 'Authorization: ••••••'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const retrieveSourceDocument = async () => {
  try {
    const response = await api.sourceDocumentSdk.read(
      "<collection-guid>",
      "<sourcedocument-guid>"
    );
    console.log(response, "SourceDocument fetched successfully");
  } catch (err) {
    console.log("Error fetching SourceDocument:", err);
  }
};

retrieveSourceDocument();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def readSourceDocument():
    document = lexi.SourceDocument.retrieve("<collection-guid>", "<sourcedocument-guid>")
    print(document)

readSourceDocument()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
            
SourceDocument sourceDocument = await sdk.SourceDocument.Retrieve(Guid.Parse("<collection-guid>"),Guid.Parse("<sourcedocument-guid>"));

Response Structure

Returns the source document object with all metadata and content information if found, or a 404 Not Found error if the document doesn't exist.

Note: The HEAD method can be used as an alternative to simply check the existence of the object. HEAD requests return either a 200 OK if the object exists, or a 404 Not Found if not. No response body is returned with a HEAD request.

Response

Returns the complete source document with metadata:

{
    "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
    "TenantGUID": "default",
    "BucketGUID": "example-data-bucket",
    "CollectionGUID": "default",
    "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
    "ObjectKey": "5.pdf",
    "ObjectVersion": "1",
    "ContentType": "application/pdf",
    "DocumentType": "Pdf",
    "SourceUrl": "http://dcc249eaaf06:8001/v1.0/tenants/default/buckets/example-data-bucket/objects/5.pdf",
    "ContentLength": 31811,
    "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
    "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
    "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
    "CreatedUtc": "2024-10-25T14:14:22.000000Z"
}

Read Source Document with Data

Retrieves a source document including its processed content data using GET /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents/[document-guid]?incldata=null. Returns the complete source document with UDR content and processed data.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection containing the document
  • document-guid (string, Path, Required): GUID of the source document to retrieve
  • incldata (string, Query, Required): Must be set to "null" to include processed content data
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents/fd937de1-480a-4db8-9025-c7ac0bd8d66c?incldata=null' \
--header 'Authorization: ••••••'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const retrieveSourceDocumentWithData = async () => {
  try {
    const response = await api.sourceDocumentSdk.read(
      "<collection-guid>",
      "<sourcedocument-guid>",
      true
    );
    console.log(response, "SourceDocument fetched successfully");
  } catch (err) {
    console.log("Error fetching SourceDocument:", err);
  }
};

retrieveSourceDocumentWithData();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def readSourceDocument():
    document = lexi.SourceDocument.retrieve("<collection-guid>", "<sourcedocument-guid>",True)
    print(document)

readSourceDocument()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
            
SourceDocument response = await sdk.SourceDocument.Retrieve(Guid.Parse("<collection-guid>"),Guid.Parse("<sourcedocument-guid>"), includeData: true);

Response Structure

Returns the source document object with complete metadata and processed content data, or a 404 Not Found error if the document doesn't exist.

Response

Returns the source document with complete metadata and processed content data:

{
    "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
    "TenantGUID": "default",
    "BucketGUID": "example-data-bucket",
    "CollectionGUID": "default",
    "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
    "ObjectKey": "5.pdf",
    "ObjectVersion": "1",
    "ContentType": "application/pdf",
    "DocumentType": "Pdf",
    "SourceUrl": "http://dcc249eaaf06:8001/v1.0/tenants/default/buckets/example-data-bucket/objects/5.pdf",
    "ContentLength": 31811,
    "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
    "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
    "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
    "CreatedUtc": "2024-10-25T14:14:22.000000Z",
    "UdrDocument": {
        "Success": true,
        "AdditionalData": "Parsed successfully",
        "Key": "5.pdf:1",
        "Type": "Pdf",
        "Metadata": {
            "Author": "John Doe",
            "Title": "Sample Document"
        },
        "Terms": ["document", "content", "analysis", "text"],
        "TopTerms": {
            "document": 15,
            "content": 12,
            "analysis": 8,
            "text": 6
        },
        "Postings": [
            {
                "Term": "document",
                "Count": 15,
                "AbsolutePositions": [0, 10, 20],
                "RelativePositions": [0, 1, 2]
            }
        ],
        "Schema": {
            "Type": "Pdf",
            "MaxDepth": 2,
            "NumObjects": 5,
            "NumArrays": 0,
            "NumKeyValues": 10
        }
    }
}

Read Top Terms

Retrieves the most frequently occurring terms within a specific source document using GET /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents/[document-guid]/topterms?max-keys=10. Useful for content analysis and term frequency analysis of individual documents.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection containing the document
  • document-guid (string, Path, Required): GUID of the source document to analyze
  • max-keys (integer, Query, Optional): Maximum number of top terms to return (default: 10)
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents/fd937de1-480a-4db8-9025-c7ac0bd8d66c/topterms?max-keys=10' \
--header 'Authorization: ••••••'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const retrieveSourceDocumentTopTerms = async () => {
  try {
    const response = await api.sourceDocumentSdk.readTopTerms(
      "<collection-guid>",
      "<sourcedocument-guid>"
    );
    console.log(response, "SourceDocument top terms fetched successfully");
  } catch (err) {
    console.log("Error fethcing SourceDocument top terms:", err);
  }
};

retrieveSourceDocumentTopTerms();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def readTopTerms():
    terms = lexi.SourceDocument.retrieve_top_terms("<collection-guid>", "<sourcedocument-guid>")
    print(terms)

readTopTerms()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
            
CollectionTopTerms response = await sdk.SourceDocument.RetrieveTopTerms(Guid.Parse("<collection-guid>"),Guid.Parse("<sourcedocument-guid>"), maxKeys: 5);

Response Structure

Returns a JSON object with terms as keys and their frequency counts as values, or a 404 Not Found error if the document doesn't exist.

Response

Returns the most frequently occurring terms within the specific document:

{
  "document": 15,
  "content": 12,
  "analysis": 8,
  "text": 6,
  "processing": 4
}

Read Statistics

Retrieves comprehensive statistics for a specific source document using GET /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents/[document-guid]?stats=null. Provides detailed metrics about document content, processing results, and analytical data.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection containing the document
  • document-guid (string, Path, Required): GUID of the source document to analyze
  • stats (string, Query, Required): Must be set to "null" to retrieve statistics
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents/fd937de1-480a-4db8-9025-c7ac0bd8d66c?stats=null' \
--header 'Authorization: ••••••'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const retrieveSourceDocumentStatistics = async () => {
  try {
    const response = await api.sourceDocumentSdk.readStatistics(
      "<collection-guid>",
      "<sourcedocument-guid>"
    );
    console.log(response, "SourceDocument stats fetched successfully");
  } catch (err) {
    console.log("Error fethcing SourceDocument stats:", err);
  }
};

retrieveSourceDocumentStatistics();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def readSourceDocumentStatistics():
    statistics = lexi.SourceDocument.retrieve_statistics("<collection-guid>", "<sourcedocument-guid>")
    print(statistics)

readSourceDocumentStatistics()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http:/localhost:8000/");
            
SourceDocumentStatistics sourceDocumentStatistics = await sdk.SourceDocument.RetrieveTopTerms(Guid.Parse("<collection-guid>"),Guid.Parse("<sourcedocument-guid>"));

Response

Returns a JSON object containing comprehensive document statistics including content metrics, processing information, and analytical data, or a 404 Not Found error if the document doesn't exist.

Response

Returns comprehensive document statistics:

{
    "Document": {
        "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
        "TenantGUID": "default",
        "BucketGUID": "example-data-bucket",
        "CollectionGUID": "default",
        "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
        "ObjectKey": "5.pdf",
        "ObjectVersion": "1",
        "ContentType": "application/pdf",
        "DocumentType": "Pdf",
        "SourceUrl": "http://dcc249eaaf06:8001/v1.0/tenants/default/buckets/example-data-bucket/objects/5.pdf",
        "ContentLength": 31811,
        "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
        "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
        "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
        "CreatedUtc": "2024-10-25T14:14:22.000000Z"
    },
    "TermCount": 45,
    "UniqueTerms": 32,
    "AverageTermLength": 6.2,
    "ProcessingTimeMs": 1250,
    "SchemaComplexity": 0.75
}

Read All Source Documents

Retrieves all source documents within a collection using GET /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents. Returns a JSON array containing all source documents with their metadata and content information.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection to retrieve documents from

Response

Returns an array of all source documents in the collection, or a 404 Not Found error if no documents exist.

curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents' \
--header 'Authorization: ••••••'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const retrieveSourceDocuments = async () => {
  try {
    const response = await api.sourceDocumentSdk.readAll(
      "<collection-guid>"
    );
    console.log(response, "SourceDocuments fetched successfully");
  } catch (err) {
    console.log("Error fetching SourceDocuments:", err);
  }
};

retrieveSourceDocuments();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def readAllSourceDocuments():
    documents = lexi.SourceDocument.retrieve_all("<collection-guid>")
    print(documents)

readAllSourceDocuments()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
            
List<SourceDocument> sourceDocuments = await sdk.SourceDocument.RetrieveMany(Guid.Parse("<collection-guid>"));

Response

Returns an array of all source documents in the collection:

[
    {
        "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
        "TenantGUID": "default",
        "BucketGUID": "example-data-bucket",
        "CollectionGUID": "default",
        "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
        "ObjectKey": "5.pdf",
        "ObjectVersion": "1",
        "ContentType": "application/pdf",
        "DocumentType": "Pdf",
        "SourceUrl": "http://dcc249eaaf06:8001/v1.0/tenants/default/buckets/example-data-bucket/objects/5.pdf",
        "ContentLength": 31811,
        "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
        "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
        "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
        "CreatedUtc": "2024-10-25T14:14:22.000000Z"
    },
    {
        "GUID": "another-document-guid",
        "TenantGUID": "default",
        "BucketGUID": "example-data-bucket",
        "CollectionGUID": "default",
        "ObjectGUID": "another-object-guid",
        "ObjectKey": "document.docx",
        "ObjectVersion": "1",
        "ContentType": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "DocumentType": "Word",
        "SourceUrl": "http://dcc249eaaf06:8001/v1.0/tenants/default/buckets/example-data-bucket/objects/document.docx",
        "ContentLength": 25600,
        "MD5Hash": "A1B2C3D4E5F6789012345678901234567890ABCD",
        "SHA1Hash": "1234567890ABCDEF1234567890ABCDEF12345678",
        "SHA256Hash": "ABCDEF1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF1234567890",
        "CreatedUtc": "2024-10-25T15:30:45.123456Z"
    }
]

Upload Source Document

Uploads a source document to a collection using PUT /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents. Creates a new source document entry with comprehensive metadata and UDR processing information for search and analysis capabilities.

Request Parameters

Required Parameters

  • TenantGUID (GUID, Body, Required): GUID of the tenant
  • CollectionGUID (GUID, Body, Required): GUID of the target collection
  • ObjectKey (string, Body, Required): Key/name of the source object
  • ObjectVersion (string, Body, Required): Version identifier of the source object
  • ObjectGUID (GUID, Body, Required): GUID of the source object
  • ContentType (string, Body, Required): MIME content type of the document
  • DocumentType (enum, Body, Required): Type of document (e.g., "JSON", "Pdf", "Text")
  • SourceUrl (string, Body, Required): URL from which the source object can be retrieved
  • UdrDocument (object, Body, Required): Unified Document Representation containing processed content, terms, and metadata

Response

Returns the uploaded source document object with all metadata and processing information:

{
    "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
    "TenantGUID": "00000000-0000-0000-0000-000000000000",
    "BucketGUID": "example-data-bucket",
    "CollectionGUID": "00000000-0000-0000-0000-000000000000",
    "ObjectGUID": "00000000-0000-0000-0000-000000000000",
    "ObjectKey": "blake.json",
    "ObjectVersion": "1",
    "ContentType": "application/json",
    "DocumentType": "JSON",
    "SourceUrl": "http://localhost:9000/tenants/default/buckets/data/objects/sample.json",
    "ContentLength": 1024,
    "MD5Hash": "DC477A85FF3882BBFDEB03D7B79ECC9E",
    "SHA1Hash": "CC5D85073F193A578F97D46B8A6E4CE946270B5F",
    "SHA256Hash": "E5285C6023A46E4E8917C67CCB56B91FED2E578A7AA3129680012C029868B321",
    "CreatedUtc": "2024-10-25T14:14:22.000000Z"
}
curl --location --request PUT 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
  "TenantGUID": "00000000-0000-0000-0000-000000000000",
  "CollectionGUID": "00000000-0000-0000-0000-000000000000",
  "ObjectKey": "blake.json",
  "ObjectVersion": "1",
  "ObjectGUID": "00000000-0000-0000-0000-000000000000",
  "ContentType": "application/json",
  "DocumentType": "JSON",
  "SourceUrl": "http://localhost:9000/tenants/default/buckets/data/objects/sample.json",
  "UdrDocument": {
    "Success": true,
    "AdditionalData": "My additional data",
    "Metadata": {
      "Foo": "Bar"
    },
    "Key": "sample.json",
    "TypeResult": {
        "MimeType": "application/json",
        "Extension": "json",
        "Type": "Json"
    },
    "Terms": [
        "foo",
        "bar",
        "baz"
    ],
    "TopTerms": {
      "foo": 1,
      "bar": 1,
      "baz": 1
    },
    "Postings": [
      {
        "Term": "baz",
        "Count": 2,
        "AbsolutePositions": [
          0
        ],
        "RelativePositions": [
          0
        ]
      },
      {
        "Term": "foo",
        "Count": 2,
        "AbsolutePositions": [
          1
        ],
        "RelativePositions": [
          1
        ]
      },
      {
        "Term": "bar",
        "Count": 2,
        "AbsolutePositions": [
          2
        ],
        "RelativePositions": [
          2
        ]
      }
    ],
    "Schema": {
      "Type": "Json",
      "MaxDepth": 1,
      "NumObjects": 1,
      "NumArrays": 0,
      "NumKeyValues": 1,
      "Schema": {
        "root": "Object",
        "root.Message": "String"
      },
      "Metadata": {
        "Foo": "Bar"
      },
      "Flattened": [
        {
          "Key": "root",
          "Type": "Object"
        },
        {
          "Key": "root.Message",
          "Type": "String",
          "Data": "Your foo is bar baz!"
        }
      ]
    }
  }
}
'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const uploadSourceDocument = async () => {
  try {
    const response = await api.sourceDocumentSdk.upload({
      TenantGUID: "<tenant-guid>",
      CollectionGUID: "<collection-guid>",
      ObjectKey: "blake.json",
      ObjectVersion: "1",
      ObjectGUID: "<object-guid>",
      ContentType: "application/json",
      DocumentType: "JSON",
      SourceUrl:
        "http://localhost:9000/tenants/default/buckets/data/objects/sample.json",
      UdrDocument: {
        Success: true,
        AdditionalData: "My additional data",
        Metadata: {
          Foo: "Bar",
        },
        Key: "sample.json",
        TypeResult: {
          MimeType: "application/json",
          Extension: "json",
          Type: "Json",
        },
        Terms: ["foo", "bar", "baz"],
        TopTerms: {
          foo: 1,
          bar: 1,
          baz: 1,
        },
        Postings: [
          {
            Term: "baz",
            Count: 2,
            AbsolutePositions: [0],
            RelativePositions: [0],
          },
          {
            Term: "foo",
            Count: 2,
            AbsolutePositions: [1],
            RelativePositions: [1],
          },
          {
            Term: "bar",
            Count: 2,
            AbsolutePositions: [2],
            RelativePositions: [2],
          },
        ],
        Schema: {
          Type: "Json",
          MaxDepth: 1,
          NumObjects: 1,
          NumArrays: 0,
          NumKeyValues: 1,
          Schema: {
            root: "Object",
            "root.Message": "String",
          },
          Metadata: {
            Foo: "Bar",
          },
          Flattened: [
            {
              Key: "root",
              Type: "Object",
            },
            {
              Key: "root.Message",
              Type: "String",
              Data: "Your foo is bar baz!",
            },
          ],
        },
      },
    });
    console.log(response, "SourceDocument uploaded successfully");
  } catch (err) {
    console.log("Error uploading SourceDocument:", err);
  }
};

uploadSourceDocument();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def uploadSourceDocument():
    document = lexi.SourceDocument.upload("<collection-guid>", "<object-guid>", "foo.txt")
    print(document)

uploadSourceDocument()
using View.Sdk;
using View.Sdk.Lexi;
using View.Sdk.Semantic;
using System.Collections.Generic;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
            
        SourceDocument sourceDocument = new SourceDocument
        {
            GUID = Guid.Parse("<sourcedocument-guid>"),
            TenantGUID = Guid.Parse("<tenant-guid>"),
            BucketGUID = Guid.Parse("<bucket-guid>"),
            CollectionGUID = Guid.Parse("<collection-guid>"),
            ObjectGUID = Guid.Parse("<object-guid>"),
            GraphRepositoryGUID = null,
            GraphNodeIdentifier = null,
            DataRepositoryGUID = null,
            DataFlowRequestGUID = null,
            DataFlowSuccess = null,
            ObjectKey = "10.pdf",
            ObjectVersion = "1",
            ContentType = "application/pdf",
            DocumentType = DocumentTypeEnum.Unknown,
            SourceUrl = "http://localhost:9000/tenants/default/buckets/sample/objects/10.pdf",
            ContentLength = 29096038,
            MD5Hash = "******************************0FCA",
            SHA1Hash = "*******************************************50F",
            SHA256Hash = "*******************************************************D3D",
            CreatedUtc = DateTime.UtcNow,
            ExpirationUtc = null,
            Score = new DocumentScore
            {
                Score = 0.85m,
                TermsScore = 0.9m,
                FiltersScore = 0.75m
            },
            UdrDocument = new UdrDocument
             {
                GUID = Guid.NewGuid(),
                Success = true,
                AdditionalData = "Parsed successfully.",
                Key = "sample.pdf:1",
                Type = DocumentTypeEnum.Pdf,
                Metadata = new Dictionary<string, object>
                {
                    { "Author", "John Doe" },
                    { "Category", "Medical" }
                },
                Terms = new List<string>
                {
                    "botox", "treatment", "dose", "patient",
                    "botox", "injection", "dose", "patient",
                    "botox"
                },
                Postings = new List<Posting>
                {
                    new Posting
                    {
                        Term = "botox",
                        AbsolutePositions = new List<long> { 5, 15, 25 },
                        RelativePositions = new List<long> { 0, 1, 2 }
                    },
                    new Posting
                    {
                        Term = "dose",
                        AbsolutePositions = new List<long> { 35, 55 },
                        RelativePositions = new List<long> { 3, 4 }
                    }
                },
                Schema = new SchemaResult
                {
                    Type = DocumentTypeEnum.Pdf,
                    Schema = new Dictionary<string, DataTypeEnum>
                    {
                        { "root", DataTypeEnum.Object },
                        { "root.Message", DataTypeEnum.String }
                    },

                    Metadata = new Dictionary<string, object>
                    {
                        { "KeyCount", 2 }
                    },
                },
                SemanticCells = new List<SemanticCell>
                {
                    new SemanticCell
                    {
                        GUID = Guid.NewGuid(),
                        CellType = SemanticCellTypeEnum.Text,
                        Position = 0,
                        Chunks = new List<SemanticChunk>
                        {
                            new SemanticChunk
                            {
                                GUID = Guid.NewGuid(),
                                Position = 0,
                                Start = 0,
                                End = 127,
                                Length = 128,
                                Content = "Botox treatment is effective." 
                            }
                        }
                    }
                }
            }
        };

SourceDocument result = await sdk.SourceDocument.Upload(sourceDocument);

Response

Returns the uploaded source document with complete metadata and UDR processing:

{
    "GUID": "1fdbe0c8-8b85-4b0e-ac42-dd4757684a9f",
    "TenantGUID": "default",
    "BucketGUID": "example-data-bucket",
    "CollectionGUID": "default",
    "ObjectGUID": "f615ac92-d1d1-4b46-8cc5-acf721131067",
    "ObjectKey": "10.pdf",
    "ObjectVersion": "1",
    "ContentType": "application/pdf",
    "DocumentType": "Pdf",
    "SourceUrl": "http://localhost:9000/tenants/default/buckets/sample/objects/10.pdf",
    "ContentLength": 29096038,
    "MD5Hash": "******************************0FCA",
    "SHA1Hash": "*******************************************50F",
    "SHA256Hash": "*******************************************************D3D",
    "CreatedUtc": "2024-10-25T14:14:22.000000Z",
    "Score": {
        "Score": 0.85,
        "TermsScore": 0.9,
        "FiltersScore": 0.75
    },
    "UdrDocument": {
        "GUID": "udr-document-guid",
        "Success": true,
        "AdditionalData": "Parsed successfully.",
        "Key": "sample.pdf:1",
        "Type": "Pdf",
        "Metadata": {
            "Author": "John Doe",
            "Category": "Medical"
        },
        "Terms": ["botox", "treatment", "dose", "patient"],
        "TopTerms": {
            "botox": 15,
            "treatment": 12,
            "dose": 8,
            "patient": 6
        },
        "Postings": [
            {
                "Term": "botox",
                "Count": 15,
                "AbsolutePositions": [5, 15, 25],
                "RelativePositions": [0, 1, 2]
            }
        ],
        "Schema": {
            "Type": "Pdf",
            "MaxDepth": 2,
            "NumObjects": 5,
            "NumArrays": 0,
            "NumKeyValues": 10
        }
    }
}

Delete Source Document

Deletes a source document by GUID using DELETE /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents/[document-guid]. Removes the document from the collection and all associated metadata.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection containing the document
  • document-guid (string, Path, Required): GUID of the source document to delete

Response

  • 200 OK: Source document deleted successfully
  • 404 Not Found: Source document does not exist

Note: Deleting a source document removes it permanently from the collection and all associated search indexes.

curl --location --request DELETE 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents/00000000-0000-0000-0000-000000000000' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••'
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const deleteSourceDocument = async () => {
  try {
    const response = await api.sourceDocumentSdk.delete(
      "<collection-guid>",
      "<sourcedocument-guid>"
    );
    console.log(response, "SourceDocument deleted successfully");
  } catch (err) {
    console.log("Error deleting SourceDocument:", err);
  }
};

deleteSourceDocument();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def deleteSourceDocument():
    response = lexi.SourceDocument.delete("document-guid","collection-guid")
    print(response)


deleteCollection()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
        
Guid CollectionGuid = Guid.Parse("<collection-guid>");
        
Guid SourceDocumentGuid = Guid.Parse("<sourcedocument-guid>");
        
bool deleted = await sdk.SourceDocument.Delete(CollectionGuid, SourceDocumentGuid);

Response

Returns 200 No Content on successful deletion. No response body is returned.

Delete Source Document by Key and Version

Deletes a source document by object key and version using DELETE /v1.0/tenants/[tenant-guid]/collections/[collection-guid]/documents/?key=blake.json&versionId=1. Provides an alternative method for document deletion when GUID is not available.

Request Parameters

  • collection-guid (string, Path, Required): GUID of the collection containing the document
  • key (string, Query, Required): Object key/name of the source document to delete
  • versionId (string, Query, Required): Version identifier of the source document to delete

Response

  • 200 OK: Source document deleted successfully
  • 404 Not Found: Source document with specified key and version does not exist

Note: This method provides an alternative way to delete documents when you have the object key and version but not the document GUID.

curl --location --request DELETE 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/collections/00000000-0000-0000-0000-000000000000/documents?key=blake.json&versionId=1' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' 
import { ViewLexiSdk } from "view-sdk";

const api = new ViewLexiSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const deleteSourceDocumentFromKey = async () => {
  try {
    const response = await api.sourceDocumentSdk.deleteFromKey(
      "<collection-guid>",
      "https://www.traegerforum.com/forums/traeger-recipes.27/", //key
      "1" //version
    );
    console.log(response, "SourceDocument deleted successfully");
  } catch (err) {
    console.log("Error deleting SourceDocument:", err);
  }
};

deleteSourceDocumentFromKey();
import view_sdk
from view_sdk import lexi
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="tenant-guid",
    service_ports={Service.LEXI:8000},
)

def uploadSourceDocument():
    document = lexi.SourceDocument.create("00000000-0000-0000-0000-000000000000",
        TenantGUID="00000000-0000-0000-0000-000000000000",
        CollectionGUID="00000000-0000-0000-0000-000000000000",
        ObjectKey="blake.json",
        ObjectVersion="1",
        ObjectGUID="00000000-0000-0000-0000-000000000000",
        ContentType="application/json",
        DocumentType="JSON",
        SourceUrl="http://localhost:9000/tenants/default/buckets/data/objects/sample.json",
        UdrDocument={}
    )
    print(document)

uploadSourceDocument()
using View.Sdk;
using View.Sdk.Lexi;

ViewLexiSdk sdk = new ViewLexiSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
        
bool deleted = await sdk.SourceDocument.Delete(Guid.Parse("<collection-guid>"), "10.pdf", "1");

Response

Returns 200 No Content on successful deletion. No response body is returned.

Best Practices

When managing source documents in the Lexi metadata database, consider the following recommendations for optimal document processing, search performance, and data management:

  • Document Organization: Organize documents within logical collections based on content type, source, or business domain for better search and management
  • Metadata Quality: Ensure comprehensive and accurate metadata is provided during document upload for optimal search and analysis capabilities
  • Content Processing: Use proper UDR document structure with complete term extraction and semantic analysis for enhanced search functionality
  • Version Management: Implement proper versioning strategies for document updates and maintain document history for audit and rollback capabilities
  • Performance Optimization: Monitor document statistics and processing metrics to optimize content analysis and search performance

Next Steps

After successfully managing source documents, you can:

  • Search Operations: Implement advanced search functionality using document content, metadata, and term analysis capabilities
  • Content Analysis: Analyze document statistics and top terms to gain insights into content patterns and trends
  • Collection Management: Organize and manage document collections for better content discovery and organization
  • Integration: Integrate source document management with other View platform services for comprehensive data processing workflows
  • Document Processing: Set up automated document processing pipelines for continuous content ingestion and analysis