Semantic Cell Extraction - View Processing Platform

Overview

Semantic cell extraction provides comprehensive content analysis capabilities within the View Processing platform. It identifies regions within data assets that contain semantically-related content, extracting them as singular units called semantic cells for enhanced content understanding and processing.

Semantic cell extraction is accessible via the View Processing API at [http|https]://[hostname]:[port]/v1.0/tenants/[tenant-guid]/processing/semanticcell and supports comprehensive content analysis for various document types and formats.

API Endpoints

PUT /v1.0/tenants/[tenant-guid]/processing/semanticcell - Extract semantic cells from data assets

Supported Document Types

Semantic cell extraction supports the following document types:

Pptx - Microsoft PowerPoint presentations
Docx - Microsoft Word documents
Xlsx - Microsoft Excel spreadsheets
Text - Plain text documents
Json - JSON data files
Xml - XML documents
Html - HTML web pages
Parquet - Parquet data files
Pdf - PDF documents
DataTable - Tabular data structures
Csv - Comma-separated value files

Extract Semantic Cells

Extracts semantic cells from data assets using PUT /v1.0/tenants/[tenant-guid]/processing/semanticcell. Requires prior type detection to determine the appropriate processing approach for the data asset.

Request Parameters

Required Parameters

DocumentType (enum, Body, Required): Document type from the supported types list
Data (string, Body, Required): Base64-encoded source data content
MetadataRule (object, Body, Required): Metadata rule configuration for semantic cell extraction

curl --location 'http://localhost:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/semanticcell' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
    "DocumentType": "Pdf",
    "MetadataRule": {
        "SemanticCellEndpoint": "http://viewdemo:8000/",
        "MinChunkContentLength": 1,
        "MaxChunkContentLength": 512,
        "ShiftSize": 512
    },
    "Data": "JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqD...."
}'

import { ViewProcessorSdk } from "view-sdk";

const api = new ViewProcessorSdk(
  "http://localhost:8000/", //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const extractSemanticCells = async () => {
  try {
    const response = await api.processSdk.extractSemanticCells({
      DocumentType: "Pdf",
      MetadataRule: {
        SemanticCellEndpoint: "http://viewdemo:8000/",
        MinChunkContentLength: 1,
        MaxChunkContentLength: 512,
        ShiftSize: 512,
      },
      Data: "JVBERi0xLjcNC...",
    });
    console.log(response);
  } catch (error) {
    console.error("Error extracting semantic cells:", error);
  }
};

extractSemanticCells();

import view_sdk
from view_sdk import processor
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.LEXI: 8000},
)

def semanticCelExtraction():
    result = processor.SemanticCell.extraction(
    DocumentType='pdf',       Data="JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbikgL1N0cnVjdFRyZWVSb290IDE4IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4vTWV0YWRhdGEgODAgMCBSL1ZpZXdlclByZWZlcmVuY2VzIDgxIDAgUj4+DQplbmRvYmoNCjIgMCBvYmoNCjw8L1R5cGUvUGFnZXMvQ291bnQgMS9LaWRzWyAzIDAgUl0gPj4NCmVuZG9iag0KMyAwIG9iag0KPDwvVHlwZS9QYWdlL1BhcmVudCAyIDAgUi9SZXNvdXJjZXM8PC9Gb250PDwvRjEgNSAwIFIvRjIgMTIgMCBSL0YzIDE0IDAgUj",
    MetadataRule={
        "SemanticCellEndpoint": "http://viewdemo:8000/",
        "MinChunkContentLength": 1,
        "MaxChunkContentLength": 512,
        "ShiftSize": 512
    })
    print(result)

semanticCelExtraction()

using View.Sdk;
using View.Sdk.Processor;

ViewProcessorSdk sdk = new ViewProcessorSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
            
SemanticCellExtractionRequest request = new SemanticCellExtractionRequest
{
    DocumentType = "Pdf",
    MetadataRule = new MetadataRule
    {
        SemanticCellEndpoint = "http://viewdemo:8000/",
        MinChunkContentLength = 1,
        MaxChunkContentLength = 512,
        ShiftSize = 512
    },
    Data = "JVBERi0xLjcNC..."
};


SemanticCellResult response = await sdk.SemanticCellExtraction.Process(request);

Response

Returns semantic cell extraction results with success status, execution timing, and extracted semantic cells containing content analysis and chunk information.

{
    "Success": true,
    "Timestamp": {
        "Start": "2025-04-30T12:48:19.218168Z",
        "End": "2025-04-30T12:48:20.151467Z",
        "TotalMs": 933.3,
        "Messages": {}
    },
    "SemanticCells": [
        {
            "GUID": "ab5fa437-e784-4cba-be5f-83936bc8259a",
            "CellType": "Table",
            "MD5Hash": "328A482C9D1C0F87B7EF5AA424B0A378",
            "SHA1Hash": "A8E2B4E01E86E7BEF14A3274064C75E268694EDB",
            "SHA256Hash": "03400563FEA89D3458D4304179F2E2690ACDC6E598B23F984B3A99737E9C5A26",
            "Position": 0,
            "Length": 322,
            "Chunks": [
                {
                    "GUID": "2a2fedc0-2bda-470c-8b38-a5964fd19f00",
                    "MD5Hash": "5B43676864350681E36C2D4BC888B6C8",
                    "SHA1Hash": "63A077417A60F0E25DD13526F1A800E8528244B4",
                    "SHA256Hash": "1B08584C2931BEDB8C15A665DAADC8018E2A127B00EF6525DB9970DB426D920E",
                    "Position": 0,
                    "Start": 0,
                    "End": 82,
                    "Length": 82,
                    "Content": "| Column1 | Column2 | Column3 |\n|---|---|---|\n| Row names | Column 1 | Column 2 |\n",
                    "Embeddings": []
                },
            ],
            "Children": []
        },
        {
            "GUID": "03ecc465-257d-4576-9209-4d22d69857a0",
            "CellType": "List",
            "MD5Hash": "CECA959DE15881953B8752F0EBF349E0",
            "SHA1Hash": "1CF7F4E3BF58B6D17D5A12A87643535EC526A3DD",
            "SHA256Hash": "783953D669985425418A39D45B78ACFB426B6A52745E677A2072A6EF0613F9FE",
            "Position": 1,
            "Length": 20,
            "Chunks": [
                {
                    "GUID": "ce3a2c40-5fe9-45b3-8936-d771d4efbbea",
                    "MD5Hash": "414EDCD286451DB29B0F65959E350FC7",
                    "SHA1Hash": "CA0FF26208DD22433102131A63973CCBA334BACB",
                    "SHA256Hash": "47E2C6E55A90203CA7B0EC13DCD30C1A0E4A830B19D62C09CE3B1390B7193E1F",
                    "Position": 0,
                    "Start": 0,
                    "End": 20,
                    "Length": 20,
                    "Content": "Item 1\nItem 2\nItem 3",
                    "Embeddings": []
                }
            ],
            "Children": []
        },
    ]
}

Best Practices

When managing semantic cell extraction in the View Processing platform, consider the following recommendations for optimal content analysis, semantic understanding, and processing efficiency:

Type Detection: Always perform type detection before semantic cell extraction to ensure appropriate processing for different document formats
Chunk Configuration: Configure appropriate chunk size settings (min/max content length) based on your content types and analysis requirements
Content Analysis: Use comprehensive content analysis settings to maximize semantic cell extraction and content understanding
Performance Optimization: Monitor semantic cell extraction performance and optimize processing parameters for large-scale content analysis
Content Understanding: Leverage extracted semantic cells for enhanced content understanding, search optimization, and AI-powered analysis

Next Steps

After successfully extracting semantic cells, you can:

UDR Generation: Generate UDR metadata incorporating extracted semantic cells for enhanced search capabilities
Embeddings Generation: Generate vector embeddings from semantic cells for AI-powered content analysis and search
Processing Pipeline: Integrate semantic cell extraction into comprehensive processing pipeline workflows
Content Analysis: Use extracted semantic cells for advanced content analysis, classification, and understanding
Search Optimization: Optimize search capabilities using semantic cell information for enhanced content discovery