This page provides an overview of APIs related to semantic cell extraction.
View semantic cell extraction is a process by which source content is analyzed to identify regions within the data asset that have a high probability of containing semantically-related content. These sections, for instance a paragraph within a Word document, are extracted as a singular unit, called a semantic cell.
Semantic cell extraction is accessible by calling PUT /v1.0/document
on the semantic cell extractor server, which by default lists on 8341
. Type detection must be performed before attempting to extract semantic cells.
The semantic cell extraction request must include the following properties:
DocumentType
enum
one ofPptx
Docx
Xlsx
Text
Json
Xml
Html
Parquet
Pdf
DataTable
Csv
MaxChunkContentLength
int
the maximum chunk content lengthShiftSize
int
the number of bytes to shift when advancing the sliding window for chunk detectionData
string
the source data, base64 encoded
An example request is as follows:
{
"DocumentType": "Text",
"MaxChunkContentLength": 512,
"ShiftSize": 512,
"Data": "VGl0bGUgb2YgQm9vawoKVGhlIGF..."
}
The result will include a Success
bool
indicating whether or not the request succeeded, along with a Timestamp
object indicating runtime. The SemanticCells
object will contain an array of the semantic cells found within the supplied payload. An example response is as follows:
{
"Success": true,
"Timestamp": {
"Start": "2024-10-28T02:48:54.244704Z",
"End": "2024-10-28T02:48:54.272924Z",
"TotalMs": 28.22,
"Messages": {}
},
"SemanticCells": [
{
"GUID": "760da08a-a58a-49d9-aa67-a4e46dbdeeaa",
"CellType": "Text",
"MD5Hash": "07CB0C1EFFA3EEC68D3C0EE4C76CF8CA",
"SHA1Hash": "A36CF35F5ECD8741F36C1D8F693A3372357800F5",
"SHA256Hash": "9627A45F9F768F906D621461F9440DC05005ADF43915ADE7FEF55AE7CA15755C",
"Position": 0,
"Length": 13,
"Chunks": [
{
"GUID": "4145f030-ca80-450c-9efb-e4318e51e40d",
"MD5Hash": "07CB0C1EFFA3EEC68D3C0EE4C76CF8CA",
"SHA1Hash": "A36CF35F5ECD8741F36C1D8F693A3372357800F5",
"SHA256Hash": "9627A45F9F768F906D621461F9440DC05005ADF43915ADE7FEF55AE7CA15755C",
"Position": 0,
"Start": 0,
"End": 13,
"Length": 13,
"Content": "Title of Book",
"Embeddings": []
},
...
],
"Children": []
},
{ ... }
]
}