This page provides an overview of APIs related to semantic cell extraction.
View semantic cell extraction is a process by which source content is analyzed to identify regions within the data asset that have a high probability of containing semantically-related content. These sections, for instance a paragraph within a Word document, are extracted as a singular unit, called a semantic cell.
Semantic cell extraction is accessible by calling PUT /v1.0/tenants/[tenant-guid]/processing/semanticcell
on the semantic cell extractor server, which by default lists on 8000
. Type detection must be performed before attempting to extract semantic cells.
The semantic cell extraction request must include the following properties:
DocumentType
enum
one ofPptx
Docx
Xlsx
Text
Json
Xml
Html
Parquet
Pdf
DataTable
Csv
MetadataRule
object
metadata rule object to define belowMaxChunkContentLength
int
the maximum chunk content lengthMinChunkContentLength
int
the minimum chunk content lengthSemanticCellEndpoint
string
endpoint to semantics cellShiftSize
int
the number of bytes to shift when advancing the sliding window for chunk detection
Data
string
the source data, base64 encoded
An example request is as follows:
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/semanticcell' \
--header 'Content-Type: application/json' \
--header 'Authorization: ••••••' \
--data '{
"DocumentType": "Pdf",
"MetadataRule": {
"SemanticCellEndpoint": "http://viewdemo:8000/",
"MinChunkContentLength": 1,
"MaxChunkContentLength": 512,
"ShiftSize": 512
},
"Data": "JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqD...."
}'
import { ViewProcessorSdk } from "view-sdk";
const api = new ViewProcessorSdk(
"http://localhost:8000/", //endpoint
"<tenant-guid>", //tenant Id
"default" //access key
);
const extractSemanticCells = async () => {
try {
const response = await api.processSdk.extractSemanticCells({
DocumentType: "Pdf",
MetadataRule: {
SemanticCellEndpoint: "http://viewdemo:8000/",
MinChunkContentLength: 1,
MaxChunkContentLength: 512,
ShiftSize: 512,
},
Data: "JVBERi0xLjcNC...",
});
console.log(response);
} catch (error) {
console.error("Error extracting semantic cells:", error);
}
};
extractSemanticCells();
import view_sdk
from view_sdk import processor
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def semanticCelExtraction():
result = processor.SemanticCell.extraction(
DocumentType='pdf', Data="JVBERi0xLjcNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbikgL1N0cnVjdFRyZWVSb290IDE4IDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4vTWV0YWRhdGEgODAgMCBSL1ZpZXdlclByZWZlcmVuY2VzIDgxIDAgUj4+DQplbmRvYmoNCjIgMCBvYmoNCjw8L1R5cGUvUGFnZXMvQ291bnQgMS9LaWRzWyAzIDAgUl0gPj4NCmVuZG9iag0KMyAwIG9iag0KPDwvVHlwZS9QYWdlL1BhcmVudCAyIDAgUi9SZXNvdXJjZXM8PC9Gb250PDwvRjEgNSAwIFIvRjIgMTIgMCBSL0YzIDE0IDAgUj",
MetadataRule={
"SemanticCellEndpoint": "http://viewdemo:8000/",
"MinChunkContentLength": 1,
"MaxChunkContentLength": 512,
"ShiftSize": 512
})
print(result)
semanticCelExtraction()
The result will include a Success
bool
indicating whether or not the request succeeded, along with a Timestamp
object indicating runtime. The SemanticCells
object will contain an array of the semantic cells found within the supplied payload. An example response is as follows:
{
"Success": true,
"Timestamp": {
"Start": "2025-04-30T12:48:19.218168Z",
"End": "2025-04-30T12:48:20.151467Z",
"TotalMs": 933.3,
"Messages": {}
},
"SemanticCells": [
{
"GUID": "ab5fa437-e784-4cba-be5f-83936bc8259a",
"CellType": "Table",
"MD5Hash": "328A482C9D1C0F87B7EF5AA424B0A378",
"SHA1Hash": "A8E2B4E01E86E7BEF14A3274064C75E268694EDB",
"SHA256Hash": "03400563FEA89D3458D4304179F2E2690ACDC6E598B23F984B3A99737E9C5A26",
"Position": 0,
"Length": 322,
"Chunks": [
{
"GUID": "2a2fedc0-2bda-470c-8b38-a5964fd19f00",
"MD5Hash": "5B43676864350681E36C2D4BC888B6C8",
"SHA1Hash": "63A077417A60F0E25DD13526F1A800E8528244B4",
"SHA256Hash": "1B08584C2931BEDB8C15A665DAADC8018E2A127B00EF6525DB9970DB426D920E",
"Position": 0,
"Start": 0,
"End": 82,
"Length": 82,
"Content": "| Column1 | Column2 | Column3 |\n|---|---|---|\n| Row names | Column 1 | Column 2 |\n",
"Embeddings": []
},
],
"Children": []
},
{
"GUID": "03ecc465-257d-4576-9209-4d22d69857a0",
"CellType": "List",
"MD5Hash": "CECA959DE15881953B8752F0EBF349E0",
"SHA1Hash": "1CF7F4E3BF58B6D17D5A12A87643535EC526A3DD",
"SHA256Hash": "783953D669985425418A39D45B78ACFB426B6A52745E677A2072A6EF0613F9FE",
"Position": 1,
"Length": 20,
"Chunks": [
{
"GUID": "ce3a2c40-5fe9-45b3-8936-d771d4efbbea",
"MD5Hash": "414EDCD286451DB29B0F65959E350FC7",
"SHA1Hash": "CA0FF26208DD22433102131A63973CCBA334BACB",
"SHA256Hash": "47E2C6E55A90203CA7B0EC13DCD30C1A0E4A830B19D62C09CE3B1390B7193E1F",
"Position": 0,
"Start": 0,
"End": 20,
"Length": 20,
"Content": "Item 1\nItem 2\nItem 3",
"Embeddings": []
}
],
"Children": []
},
]
}