This page provides an overview of APIs related to generating metadata.
View natively generated metadata from source data assets in the form of Universal Data Representation (UDR). UDR is used as the foundational data representation by Lexi, enabling both simple and advanced search capabilities.
Metadata within UDR documents includes:
- The list of key terms identified within the data asset and their frequency
- The full list of terms identified within the data asset
- The inferred schema, when the supplied content type implies a schema is present
- When a schema is present, a flattened representation of the document to simplify queries
- The postings, i.e. an inverted index over the document, including terms, frequencies, and absolute and relative positions of each
Within the processing pipeline, semantic cell extraction is handled as a separate process, and the results are appended to UDR documents.
To generate a UDR document for a given data asset, you must first know its data type through type detection. Then, call PUT /v1.0/document
to the document processor server, which by default lists on port 8321
.
{
"GUID": "00000000-0000-0000-0000-000000000000",
"Key": "testfile.txt",
"ContentType": "text/plain",
"Type": "Pdf",
"IncludeFlattened": true,
"CaseInsensitive": true,
"TopTerms": 10,
"AdditionalData": "The body below is simple sample text, base64 encoded, taken from https://en.wikipedia.org/wiki/Artificial_intelligence.",
"Metadata": {
"foo": "bar"
},
"Data": "QXJ0aWZpY2lhbCBpbnRlbGxpZ2VuY2UgKEFJKSwgaW4gaXRzIGJyb2FkZXN0IHNlbnNlLCBpcyBpbnRlbGxpZ2VuY2UgZXhoaWJpdGVkIGJ5IG1hY2hpbmVzLCBwYXJ0aWN1bGFybHkgY29tcHV0ZXIgc3lzdGVtcy4gSXQgaXMgYSBmaWVsZCBvZiByZXNlYXJjaCBpbiBjb21wdXRlciBzY2llbmNlIHRoYXQgZGV2ZWxvcHMgYW5kIHN0dWRpZXMgbWV0aG9kcyBhbmQgc29mdHdhcmUgdGhhdCBlbmFibGUgbWFjaGluZXMgdG8gcGVyY2VpdmUgdGhlaXIgZW52aXJvbm1lbnQgYW5kIHVzZSBsZWFybmluZyBhbmQgaW50ZWxsaWdlbmNlIHRvIHRha2UgYWN0aW9ucyB0aGF0IG1heGltaXplIHRoZWlyIGNoYW5jZXMgb2YgYWNoaWV2aW5nIGRlZmluZWQgZ29hbHMuWzFdIFN1Y2ggbWFjaGluZXMgbWF5IGJlIGNhbGxlZCBBSXMuCgpTb21lIGhpZ2gtcHJvZmlsZSBhcHBsaWNhdGlvbnMgb2YgQUkgaW5jbHVkZSBhZHZhbmNlZCB3ZWIgc2VhcmNoIGVuZ2luZXMgKGUuZy4sIEdvb2dsZSBTZWFyY2gpOyByZWNvbW1lbmRhdGlvbiBzeXN0ZW1zICh1c2VkIGJ5IFlvdVR1YmUsIEFtYXpvbiwgYW5kIE5ldGZsaXgpOyBpbnRlcmFjdGluZyB2aWEgaHVtYW4gc3BlZWNoIChlLmcuLCBHb29nbGUgQXNzaXN0YW50LCBTaXJpLCBhbmQgQWxleGEpOyBhdXRvbm9tb3VzIHZlaGljbGVzIChlLmcuLCBXYXltbyk7IGdlbmVyYXRpdmUgYW5kIGNyZWF0aXZlIHRvb2xzIChlLmcuLCBDaGF0R1BULCBBcHBsZSBJbnRlbGxpZ2VuY2UsIGFuZCBBSSBhcnQpOyBhbmQgc3VwZXJodW1hbiBwbGF5IGFuZCBhbmFseXNpcyBpbiBzdHJhdGVneSBnYW1lcyAoZS5nLiwgY2hlc3MgYW5kIEdvKS5bMl0gSG93ZXZlciwgbWFueSBBSSBhcHBsaWNhdGlvbnMgYXJlIG5vdCBwZXJjZWl2ZWQgYXMgQUk6ICJBIGxvdCBvZiBjdXR0aW5nIGVkZ2UgQUkgaGFzIGZpbHRlcmVkIGludG8gZ2VuZXJhbCBhcHBsaWNhdGlvbnMsIG9mdGVuIHdpdGhvdXQgYmVpbmcgY2FsbGVkIEFJIGJlY2F1c2Ugb25jZSBzb21ldGhpbmcgYmVjb21lcyB1c2VmdWwgZW5vdWdoIGFuZCBjb21tb24gZW5vdWdoIGl0J3Mgbm90IGxhYmVsZWQgQUkgYW55bW9yZS4iWzNdWzRd"
}
The response body will contain a fully-populated UDR document.
{
"GUID": "00000000-0000-0000-0000-000000000000",
"Success": true,
"Timestamp": {
"Start": "2024-10-28T02:51:13.126562Z",
"End": "2024-10-28T02:51:15.934068Z",
"TotalMs": 2807.51,
"Messages": {}
},
"AdditionalData": "The body below is simple sample text, base64 encoded, taken from https://en.wikipedia.org/wiki/Artificial_intelligence.",
"Metadata": {
"foo": "bar"
},
"Type": "Pdf",
"Terms": [
"running",
"pip",
"root",
"user",
...
],
"TopTerms": {
"root": 4,
"file": 4,
"line": 4,
...
},
"Schema": {
"Type": "Pdf",
"Schema": {},
"Metadata": {},
"Flattened": []
},
"Postings": [
{
"Term": "running",
"Count": 1,
"AbsolutePositions": [
2
]
},
{ ... }
],
"SemanticCells": []
}