Semantic Cell Extraction

This page provides an overview of APIs related to semantic cell extraction.

View semantic cell extraction is a process by which source content is analyzed to identify regions within the data asset that have a high probability of containing semantically-related content. These sections, for instance a paragraph within a Word document, are extracted as a singular unit, called a semantic cell.

Semantic cell extraction is accessible by calling PUT /v1.0/tenants/[tenant-guid]/processing/semanticcell on the semantic cell extractor server, which by default lists on 8000. Type detection must be performed before attempting to extract semantic cells.

The semantic cell extraction request must include the following properties:

  • DocumentType enum one of Pptx Docx Xlsx Text Json Xml Html Parquet Pdf DataTable Csv
  • MetadataRule object metadata rule object to define below
    • MaxChunkContentLength int the maximum chunk content length
    • MinChunkContentLength int the minimum chunk content length
    • SemanticCellEndpoint string endpoint to semantics cell
    • ShiftSize int the number of bytes to shift when advancing the sliding window for chunk detection
  • Data string the source data, base64 encoded

An example request is as follows:

curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/semanticcell' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
    "DocumentType": "Pdf",
    "MetadataRule": {
        "SemanticCellEndpoint": "http://viewdemo:8000/",
        "MinChunkContentLength": 1,
        "MaxChunkContentLength": 512,
        "ShiftSize": 512
    },
    "Data": "JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqD...."
}'
import { ViewProcessorSdk } from "view-sdk";

const api = new ViewProcessorSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const extractSemanticCells = async () => {
  try {
    const response = await api.processSdk.extractSemanticCells({
      DocumentType: "Pdf",
      MetadataRule: {
        SemanticCellEndpoint: "http://viewdemo:8000/",
        MinChunkContentLength: 1,
        MaxChunkContentLength: 512,
        ShiftSize: 512,
      },
      Data: "JVBERi0xLjcNC...",
    });
    console.log(response);
  } catch (error) {
    console.error("Error extracting semantic cells:", error);
  }
};

extractSemanticCells();
import view_sdk
from view_sdk import processor

sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")

def semanticCelExtraction():
    result = processor.SemanticCell.extraction(
    DocumentType='pdf',       Data="JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbikgL1N0cnVjdFRyZWVSb290IDE4IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4vTWV0YWRhdGEgODAgMCBSL1ZpZXdlclByZWZlcmVuY2VzIDgxIDAgUj4+DQplbmRvYmoNCjIgMCBvYmoNCjw8L1R5cGUvUGFnZXMvQ291bnQgMS9LaWRzWyAzIDAgUl0gPj4NCmVuZG9iag0KMyAwIG9iag0KPDwvVHlwZS9QYWdlL1BhcmVudCAyIDAgUi9SZXNvdXJjZXM8PC9Gb250PDwvRjEgNSAwIFIvRjIgMTIgMCBSL0YzIDE0IDAgUj",
    MetadataRule={
        "SemanticCellEndpoint": "http://viewdemo:8000/",
        "MinChunkContentLength": 1,
        "MaxChunkContentLength": 512,
        "ShiftSize": 512
    })
    print(result)

semanticCelExtraction()

The result will include a Success bool indicating whether or not the request succeeded, along with a Timestamp object indicating runtime. The SemanticCells object will contain an array of the semantic cells found within the supplied payload. An example response is as follows:

{
    "Success": true,
    "Timestamp": {
        "Start": "2025-04-30T12:48:19.218168Z",
        "End": "2025-04-30T12:48:20.151467Z",
        "TotalMs": 933.3,
        "Messages": {}
    },
    "SemanticCells": [
        {
            "GUID": "ab5fa437-e784-4cba-be5f-83936bc8259a",
            "CellType": "Table",
            "MD5Hash": "328A482C9D1C0F87B7EF5AA424B0A378",
            "SHA1Hash": "A8E2B4E01E86E7BEF14A3274064C75E268694EDB",
            "SHA256Hash": "03400563FEA89D3458D4304179F2E2690ACDC6E598B23F984B3A99737E9C5A26",
            "Position": 0,
            "Length": 322,
            "Chunks": [
                {
                    "GUID": "2a2fedc0-2bda-470c-8b38-a5964fd19f00",
                    "MD5Hash": "5B43676864350681E36C2D4BC888B6C8",
                    "SHA1Hash": "63A077417A60F0E25DD13526F1A800E8528244B4",
                    "SHA256Hash": "1B08584C2931BEDB8C15A665DAADC8018E2A127B00EF6525DB9970DB426D920E",
                    "Position": 0,
                    "Start": 0,
                    "End": 82,
                    "Length": 82,
                    "Content": "| Column1 | Column2 | Column3 |\n|---|---|---|\n| Row names | Column 1 | Column 2 |\n",
                    "Embeddings": []
                },
            ],
            "Children": []
        },
        {
            "GUID": "03ecc465-257d-4576-9209-4d22d69857a0",
            "CellType": "List",
            "MD5Hash": "CECA959DE15881953B8752F0EBF349E0",
            "SHA1Hash": "1CF7F4E3BF58B6D17D5A12A87643535EC526A3DD",
            "SHA256Hash": "783953D669985425418A39D45B78ACFB426B6A52745E677A2072A6EF0613F9FE",
            "Position": 1,
            "Length": 20,
            "Chunks": [
                {
                    "GUID": "ce3a2c40-5fe9-45b3-8936-d771d4efbbea",
                    "MD5Hash": "414EDCD286451DB29B0F65959E350FC7",
                    "SHA1Hash": "CA0FF26208DD22433102131A63973CCBA334BACB",
                    "SHA256Hash": "47E2C6E55A90203CA7B0EC13DCD30C1A0E4A830B19D62C09CE3B1390B7193E1F",
                    "Position": 0,
                    "Start": 0,
                    "End": 20,
                    "Length": 20,
                    "Content": "Item 1\nItem 2\nItem 3",
                    "Embeddings": []
                }
            ],
            "Children": []
        },
    ]
}