Comprehensive guide to semantic cell extraction in the View Processing platform for content analysis and semantic understanding.
Overview
Semantic cell extraction provides comprehensive content analysis capabilities within the View Processing platform. It identifies regions within data assets that contain semantically-related content, extracting them as singular units called semantic cells for enhanced content understanding and processing.
Semantic cell extraction is accessible via the View Processing API at [http|https]://[hostname]:[port]/v1.0/tenants/[tenant-guid]/processing/semanticcell
and supports comprehensive content analysis for various document types and formats.
API Endpoints
- PUT
/v1.0/tenants/[tenant-guid]/processing/semanticcell
- Extract semantic cells from data assets
Supported Document Types
Semantic cell extraction supports the following document types:
- Pptx - Microsoft PowerPoint presentations
- Docx - Microsoft Word documents
- Xlsx - Microsoft Excel spreadsheets
- Text - Plain text documents
- Json - JSON data files
- Xml - XML documents
- Html - HTML web pages
- Parquet - Parquet data files
- Pdf - PDF documents
- DataTable - Tabular data structures
- Csv - Comma-separated value files
Extract Semantic Cells
Extracts semantic cells from data assets using PUT /v1.0/tenants/[tenant-guid]/processing/semanticcell
. Requires prior type detection to determine the appropriate processing approach for the data asset.
Request Parameters
Required Parameters
- DocumentType (enum, Body, Required): Document type from the supported types list
- Data (string, Body, Required): Base64-encoded source data content
- MetadataRule (object, Body, Required): Metadata rule configuration for semantic cell extraction
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/semanticcell' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
"DocumentType": "Pdf",
"MetadataRule": {
"SemanticCellEndpoint": "http://viewdemo:8000/",
"MinChunkContentLength": 1,
"MaxChunkContentLength": 512,
"ShiftSize": 512
},
"Data": "JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqD...."
}'
import { ViewProcessorSdk } from "view-sdk";
const api = new ViewProcessorSdk(
"http://localhost:8000/", //endpoint
"<tenant-guid>", //tenant Id
"default" //access key
);
const extractSemanticCells = async () => {
try {
const response = await api.processSdk.extractSemanticCells({
DocumentType: "Pdf",
MetadataRule: {
SemanticCellEndpoint: "http://viewdemo:8000/",
MinChunkContentLength: 1,
MaxChunkContentLength: 512,
ShiftSize: 512,
},
Data: "JVBERi0xLjcNC...",
});
console.log(response);
} catch (error) {
console.error("Error extracting semantic cells:", error);
}
};
extractSemanticCells();
import view_sdk
from view_sdk import processor
from view_sdk.sdk_configuration import Service
sdk = view_sdk.configure(
access_key="default",
base_url="localhost",
tenant_guid="default",
service_ports={Service.LEXI: 8000},
)
def semanticCelExtraction():
result = processor.SemanticCell.extraction(
DocumentType='pdf', Data="JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbikgL1N0cnVjdFRyZWVSb290IDE4IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4vTWV0YWRhdGEgODAgMCBSL1ZpZXdlclByZWZlcmVuY2VzIDgxIDAgUj4+DQplbmRvYmoNCjIgMCBvYmoNCjw8L1R5cGUvUGFnZXMvQ291bnQgMS9LaWRzWyAzIDAgUl0gPj4NCmVuZG9iag0KMyAwIG9iag0KPDwvVHlwZS9QYWdlL1BhcmVudCAyIDAgUi9SZXNvdXJjZXM8PC9Gb250PDwvRjEgNSAwIFIvRjIgMTIgMCBSL0YzIDE0IDAgUj",
MetadataRule={
"SemanticCellEndpoint": "http://viewdemo:8000/",
"MinChunkContentLength": 1,
"MaxChunkContentLength": 512,
"ShiftSize": 512
})
print(result)
semanticCelExtraction()
using View.Sdk;
using View.Sdk.Processor;
ViewProcessorSdk sdk = new ViewProcessorSdk(Guid.Parse("<tenant-guid>"),"default", "http://localhost:8000/");
SemanticCellExtractionRequest request = new SemanticCellExtractionRequest
{
DocumentType = "Pdf",
MetadataRule = new MetadataRule
{
SemanticCellEndpoint = "http://viewdemo:8000/",
MinChunkContentLength = 1,
MaxChunkContentLength = 512,
ShiftSize = 512
},
Data = "JVBERi0xLjcNC..."
};
SemanticCellResult response = await sdk.SemanticCellExtraction.Process(request);
Response
Returns semantic cell extraction results with success status, execution timing, and extracted semantic cells containing content analysis and chunk information.
{
"Success": true,
"Timestamp": {
"Start": "2025-04-30T12:48:19.218168Z",
"End": "2025-04-30T12:48:20.151467Z",
"TotalMs": 933.3,
"Messages": {}
},
"SemanticCells": [
{
"GUID": "ab5fa437-e784-4cba-be5f-83936bc8259a",
"CellType": "Table",
"MD5Hash": "328A482C9D1C0F87B7EF5AA424B0A378",
"SHA1Hash": "A8E2B4E01E86E7BEF14A3274064C75E268694EDB",
"SHA256Hash": "03400563FEA89D3458D4304179F2E2690ACDC6E598B23F984B3A99737E9C5A26",
"Position": 0,
"Length": 322,
"Chunks": [
{
"GUID": "2a2fedc0-2bda-470c-8b38-a5964fd19f00",
"MD5Hash": "5B43676864350681E36C2D4BC888B6C8",
"SHA1Hash": "63A077417A60F0E25DD13526F1A800E8528244B4",
"SHA256Hash": "1B08584C2931BEDB8C15A665DAADC8018E2A127B00EF6525DB9970DB426D920E",
"Position": 0,
"Start": 0,
"End": 82,
"Length": 82,
"Content": "| Column1 | Column2 | Column3 |\n|---|---|---|\n| Row names | Column 1 | Column 2 |\n",
"Embeddings": []
},
],
"Children": []
},
{
"GUID": "03ecc465-257d-4576-9209-4d22d69857a0",
"CellType": "List",
"MD5Hash": "CECA959DE15881953B8752F0EBF349E0",
"SHA1Hash": "1CF7F4E3BF58B6D17D5A12A87643535EC526A3DD",
"SHA256Hash": "783953D669985425418A39D45B78ACFB426B6A52745E677A2072A6EF0613F9FE",
"Position": 1,
"Length": 20,
"Chunks": [
{
"GUID": "ce3a2c40-5fe9-45b3-8936-d771d4efbbea",
"MD5Hash": "414EDCD286451DB29B0F65959E350FC7",
"SHA1Hash": "CA0FF26208DD22433102131A63973CCBA334BACB",
"SHA256Hash": "47E2C6E55A90203CA7B0EC13DCD30C1A0E4A830B19D62C09CE3B1390B7193E1F",
"Position": 0,
"Start": 0,
"End": 20,
"Length": 20,
"Content": "Item 1\nItem 2\nItem 3",
"Embeddings": []
}
],
"Children": []
},
]
}
Best Practices
When managing semantic cell extraction in the View Processing platform, consider the following recommendations for optimal content analysis, semantic understanding, and processing efficiency:
- Type Detection: Always perform type detection before semantic cell extraction to ensure appropriate processing for different document formats
- Chunk Configuration: Configure appropriate chunk size settings (min/max content length) based on your content types and analysis requirements
- Content Analysis: Use comprehensive content analysis settings to maximize semantic cell extraction and content understanding
- Performance Optimization: Monitor semantic cell extraction performance and optimize processing parameters for large-scale content analysis
- Content Understanding: Leverage extracted semantic cells for enhanced content understanding, search optimization, and AI-powered analysis
Next Steps
After successfully extracting semantic cells, you can:
- UDR Generation: Generate UDR metadata incorporating extracted semantic cells for enhanced search capabilities
- Embeddings Generation: Generate vector embeddings from semantic cells for AI-powered content analysis and search
- Processing Pipeline: Integrate semantic cell extraction into comprehensive processing pipeline workflows
- Content Analysis: Use extracted semantic cells for advanced content analysis, classification, and understanding
- Search Optimization: Optimize search capabilities using semantic cell information for enhanced content discovery