This page provides an overview of APIs related to semantic cell extraction.

View semantic cell extraction is a process by which source content is analyzed to identify regions within the data asset that have a high probability of containing semantically-related content. These sections, for instance a paragraph within a Word document, are extracted as a singular unit, called a semantic cell.

Semantic cell extraction is accessible by calling PUT /v1.0/document on the semantic cell extractor server, which by default lists on 8341. Type detection must be performed before attempting to extract semantic cells.

The semantic cell extraction request must include the following properties:

  • DocumentType enum one of Pptx Docx Xlsx Text Json Xml Html Parquet Pdf DataTable Csv
  • MaxChunkContentLength int the maximum chunk content length
  • ShiftSize int the number of bytes to shift when advancing the sliding window for chunk detection
  • Data string the source data, base64 encoded

An example request is as follows:

{
    "DocumentType": "Text",
    "MaxChunkContentLength": 512,
    "ShiftSize": 512,
    "Data": "VGl0bGUgb2YgQm9vawoKVGhlIGF..."
}

The result will include a Success bool indicating whether or not the request succeeded, along with a Timestamp object indicating runtime. The SemanticCells object will contain an array of the semantic cells found within the supplied payload. An example response is as follows:

{
    "Success": true,
    "Timestamp": {
        "Start": "2024-10-28T02:48:54.244704Z",
        "End": "2024-10-28T02:48:54.272924Z",
        "TotalMs": 28.22,
        "Messages": {}
    },
    "SemanticCells": [
        {
            "GUID": "760da08a-a58a-49d9-aa67-a4e46dbdeeaa",
            "CellType": "Text",
            "MD5Hash": "07CB0C1EFFA3EEC68D3C0EE4C76CF8CA",
            "SHA1Hash": "A36CF35F5ECD8741F36C1D8F693A3372357800F5",
            "SHA256Hash": "9627A45F9F768F906D621461F9440DC05005ADF43915ADE7FEF55AE7CA15755C",
            "Position": 0,
            "Length": 13,
            "Chunks": [
                {
                    "GUID": "4145f030-ca80-450c-9efb-e4318e51e40d",
                    "MD5Hash": "07CB0C1EFFA3EEC68D3C0EE4C76CF8CA",
                    "SHA1Hash": "A36CF35F5ECD8741F36C1D8F693A3372357800F5",
                    "SHA256Hash": "9627A45F9F768F906D621461F9440DC05005ADF43915ADE7FEF55AE7CA15755C",
                    "Position": 0,
                    "Start": 0,
                    "End": 13,
                    "Length": 13,
                    "Content": "Title of Book",
                    "Embeddings": []
                },
                ...
            ],
            "Children": []
        },
        { ... }
    ]
}