Comprehensive guide to View's crawl filter management system, including content filtering, size constraints, directory traversal, and data ingestion configuration for efficient content discovery and processing.
Overview
The View Crawl Filter management system provides comprehensive configuration for content discovery and data ingestion filtering. Crawl filters serve as reusable templates that define what content from data repositories should be crawled, enabling precise control over data ingestion processes, content filtering, and resource optimization for efficient content discovery and processing.
Key Features
- Content Filtering: Advanced filtering based on content types, file extensions, and MIME types
- Size Constraints: Configurable minimum and maximum file size limits for crawling
- Directory Traversal: Control over subdirectory inclusion and hierarchical crawling
- Template Reusability: Reusable filter templates that can be referenced by multiple crawl plans
- Performance Optimization: Efficient filtering to reduce unnecessary data processing
- Flexible Configuration: Support for wildcard patterns and custom filtering criteria
- Resource Management: Optimized resource usage through intelligent content filtering
- Integration Support: Seamless integration with crawl plans and data ingestion workflows
Supported Operations
- Create: Create new crawl filter configurations with content and size constraints
- Read: Retrieve individual crawl filter configurations and metadata
- Enumerate: List all crawl filters with pagination support
- Update: Modify existing crawl filter configurations and settings
- Delete: Remove crawl filter configurations and associated templates
- Existence Check: Verify crawl filter presence without retrieving details
API Endpoints
Crawl filters are managed via the Crawler server API at [http|https]://[hostname]:[port]/v1.0/tenants/[tenant-guid]/crawlfilters
Supported HTTP Methods: GET
, HEAD
, PUT
, DELETE
Important: All crawl filter operations require appropriate authentication tokens.
Crawl Filter Object Structure
Crawl filter objects contain comprehensive configuration for content discovery and data ingestion filtering. Here's the complete structure:
{
"GUID": "defaultfilter",
"TenantGUID": "default",
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
Field Descriptions
- GUID (GUID): Globally unique identifier for the crawl filter object
- TenantGUID (GUID): Globally unique identifier for the tenant
- Name (string): Display name for the crawl filter
- MinimumSize (integer): Minimum size of objects considered candidates for retrieval (in bytes)
- MaximumSize (integer): Maximum size of objects considered candidates for retrieval (in bytes)
- IncludeSubdirectories (boolean): Whether subdirectories should be crawled recursively
- ContentType (string): Content types that should be considered candidates for retrieval (use "*" for all types)
- CreatedUtc (datetime): UTC timestamp when the crawl filter was created
Important Notes
- Size Constraints: Use size limits to optimize crawling performance and resource usage
- Content Type Filtering: Use specific content types or "*" for all types based on your requirements
- Directory Traversal: Enable subdirectory crawling for comprehensive content discovery
- Template Reusability: Crawl filters can be referenced by multiple crawl plans for consistent filtering
Create Crawl Filter
Creates a new crawl filter configuration using PUT /v1.0/tenants/[tenant-guid]/crawlfilters
. This endpoint allows you to configure content filtering, size constraints, and directory traversal settings for efficient data ingestion and content discovery.
Request Parameters
Required Parameters
- Name (string, Body, Required): Display name for the crawl filter
Optional Parameters
- MinimumSize (integer, Body, Optional): Minimum size of objects to crawl (in bytes, defaults to 1)
- MaximumSize (integer, Body, Optional): Maximum size of objects to crawl (in bytes, defaults to 134217728)
- IncludeSubdirectories (boolean, Body, Optional): Whether to crawl subdirectories recursively (defaults to true)
- ContentType (string, Body, Optional): Content types to include in crawling (defaults to "*" for all types)
Important Notes
- Size Optimization: Configure appropriate size limits to optimize crawling performance and resource usage
- Content Type Filtering: Use specific content types or "*" for all types based on your data ingestion requirements
- Directory Traversal: Enable subdirectory crawling for comprehensive content discovery
- Template Usage: Created filters can be referenced by multiple crawl plans for consistent filtering
curl --location --request PUT 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlfilters' \
--header 'content-type: application/json' \
--header 'Authorization: ••••••' \
--data '{
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*"
}'
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const createCrawlFilter = async () => {
try {
const response = await api.CrawlFilter.create({
Name: "My filter [ASH]",
MinimumSize: 1,
MaximumSize: 134217728,
IncludeSubdirectories: true,
ContentType: "*",
});
console.log(response, "Crawl filter created successfully");
} catch (err) {
console.log("Error creating Crawl filter:", err);
}
};
createCrawlFilter();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def createCrawlFilter():
crawlFilter = crawler.CrawlFilter.create(
Name="My filter",
MinimumSize=1,
MaximumSize=134217728,
IncludeSubdirectories=True,
ContentType="*"
)
print(crawlFilter)
createCrawlFilter()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
CrawlFilter filter = new CrawlFilter
{
Name="My filter",
MinimumSize=1,
MaximumSize=134217728,
IncludeSubdirectories=True,
ContentType="*"
};
CrawlFilter createdFilter = await sdk.CrawlFilter.Create(filter);
Response
Returns the created crawl filter object with all configuration details:
{
"GUID": "defaultfilter",
"TenantGUID": "default",
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
Enumerate Crawl Filters
Retrieves a paginated list of all crawl filter objects in the tenant using GET /v2.0/tenants/[tenant-guid]/crawlfilters
. This endpoint provides comprehensive enumeration with pagination support for managing multiple crawl filter configurations.
Request Parameters
No additional parameters required beyond authentication.
Response Structure
The enumeration response includes pagination metadata and crawl filter objects with complete configuration details:
{
"Success": true,
"Timestamp": {
"Start": "2024-10-21T02:36:37.677751Z",
"TotalMs": 23.58,
"Messages": {}
},
"MaxResults": 10,
"IterationsRequired": 1,
"EndOfResults": true,
"RecordsRemaining": 16,
"Objects": [
{
"GUID": "example-crawlfilter",
... crawlfilter details ...
},
{ ... }
],
"ContinuationToken": "[continuation-token]"
}
curl --location 'http://view.homedns.org:8000/v2.0/tenants/00000000-0000-0000-0000-000000000000/crawlfilters/' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const enumerateCrawlFilters = async () => {
try {
const response = await api.CrawlFilter.enumerate();
console.log(response, "Crawl filters fetched successfully");
} catch (err) {
console.log("Error fetching Crawl filters:", err);
}
};
enumerateCrawlFilters();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def enumerateCrawlFilters():
crawlFilters = crawler.CrawlFilter.enumerate()
print(crawlFilters)
enumerateCrawlFilters()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
EnumerationResult<CrawlFilter> response = await sdk.CrawlFilter.Enumerate();
Response
Returns a paginated list of crawl filter objects:
{
"Success": true,
"Timestamp": {
"Start": "2024-10-21T02:36:37.677751Z",
"TotalMs": 23.58,
"Messages": {}
},
"MaxResults": 10,
"IterationsRequired": 1,
"EndOfResults": true,
"RecordsRemaining": 0,
"Objects": [
{
"GUID": "defaultfilter",
"TenantGUID": "default",
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
],
"ContinuationToken": null
}
Read Crawl Filter
Retrieves crawl filter configuration and metadata by GUID using GET /v1.0/tenants/[tenant-guid]/crawlfilters/[crawlfilter-guid]
. Returns the complete crawl filter configuration including content filtering, size constraints, and directory traversal settings. If the filter doesn't exist, a 404 error is returned.
Request Parameters
- crawlfilter-guid (string, Path, Required): GUID of the crawl filter object to retrieve
{
"GUID": "default",
"TenantGUID": "default",
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"Prefix": "myprefix",
"Suffix": ".pptx",
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const readCrawlFilter = async () => {
try {
const response = await api.CrawlFilter.read(
"<crawlfilter-guid>"
);
console.log(response, "Crawl filter fetched successfully");
} catch (err) {
console.log("Error fetching Crawl filter:", err);
}
};
readCrawlFilter();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def readCrawlFilter():
crawlFilter = crawler.CrawlFilter.retrieve("<crawlfilter-guid>")
print(crawlFilter)
readCrawlFilter()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
CrawlFilter response = await sdk.CrawlFilter.Retrieve(Guid.Parse("<crawlfilter-guid>"));
Response
Returns the complete crawl filter configuration:
{
"GUID": "defaultfilter",
"TenantGUID": "default",
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
Note: the HEAD
method can be used as an alternative to get to simply check the existence of the object. HEAD
requests return either a 200/OK
in the event the object exists, or a 404/Not Found
if not. No response body is returned with a HEAD
request.
Read All Crawl Filters
Retrieves all crawl filter objects in the tenant using GET /v1.0/tenants/[tenant-guid]/crawlfilters/
. Returns an array of crawl filter objects with complete configuration details for all filters in the tenant.
Request Parameters
No additional parameters required beyond authentication.
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlfilters/' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const readAllCrawlFilters = async () => {
try {
const response = await api.CrawlFilter.readAll();
console.log(response, "All crawl filters fetched successfully");
} catch (err) {
console.log("Error fetching All crawl filters:", err);
}
};
readAllCrawlFilters();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def readAllCrawlFilters():
crawlFilters = crawler.CrawlFilter.retrieve_all()
print(crawlFilters)
readAllCrawlFilters()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
List<CrawlFilter> response = await sdk.CrawlFilter.RetrieveMany();
Response
Returns an array of all crawl filter objects:
[
{
"GUID": "defaultfilter",
"TenantGUID": "default",
"Name": "My filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
},
{
"GUID": "large-files-filter",
"TenantGUID": "default",
"Name": "Large files filter",
"MinimumSize": 1048576,
"MaximumSize": 1073741824,
"IncludeSubdirectories": true,
"ContentType": "application/pdf",
"CreatedUtc": "2024-07-11T10:15:30.123456Z"
}
]
Update Crawl Filter
Updates an existing crawl filter configuration using PUT /v1.0/tenants/[tenant-guid]/crawlfilters/[crawlfilter-guid]
. This endpoint allows you to modify crawl filter parameters while preserving certain immutable fields.
Request Parameters
- crawlfilter-guid (string, Path, Required): GUID of the crawl filter object to update
Updateable Fields
All configuration parameters can be updated except for:
- GUID: Immutable identifier
- TenantGUID: Immutable tenant association
- CreatedUtc: Immutable creation timestamp
Important Notes
- Field Preservation: Certain fields cannot be modified and will be preserved across updates
- Complete Object: Provide a fully populated object in the request body
- Configuration Validation: All updated parameters will be validated before applying changes
- Performance Impact: Consider the impact of filter changes on existing crawl operations
Request body:
{
"GUID": "default",
"TenantGUID": "default",
"Name": "My updated filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*"
}
curl --location --request PUT 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlfilters/00000000-0000-0000-0000-000000000000' \
--header 'content-type: application/json' \
--header 'Authorization: ••••••' \
--data '{
"Name": "My updated filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*"
}'
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const updateCrawlFilter = async () => {
try {
const response = await api.CrawlFilter.update({
GUID: "<crawlfilter-guid>",
TenantGUID: "<tenant-guid>",
Name: "My filter [ASH] [UPDATED]",
MinimumSize: 1,
MaximumSize: 134217728,
IncludeSubdirectories: true,
ContentType: "*",
CreatedUtc: "2025-04-01T10:47:14.382138Z",
});
console.log(response, "Crawl filter updated successfully");
} catch (err) {
console.log("Error updating Crawl filter:", err);
}
};
updateCrawlFilter();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def updateCrawlFilter():
crawlFilter = crawler.CrawlFilter.update(
"<crawlfilter-guid>",
Name="My filter [updated]",
MinimumSize=1,
MaximumSize=134217728,
IncludeSubdirectories=True,
ContentType="*"
)
print(crawlFilter)
updateCrawlFilter()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
CrawlFilter filter = new CrawlFilter
{
GUID = "<crawlfilter-guid>",
TenantGUID = "<tenant-guid>",
Name="My filter",
MinimumSize=1,
MaximumSize=134217728,
IncludeSubdirectories=True,
ContentType="*"
};
CrawlFilter createdFilter = await sdk.CrawlFilter.Update(filter);
Response
Returns the updated crawl filter object with all configuration details:
{
"GUID": "defaultfilter",
"TenantGUID": "default",
"Name": "My updated filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
Response body:
{
"GUID": "default",
"TenantGUID": "default",
"Name": "My updated filter",
"MinimumSize": 1,
"MaximumSize": 134217728,
"IncludeSubdirectories": true,
"ContentType": "*",
"CreatedUtc": "2024-07-10T05:21:00.000000Z"
}
Delete Crawl Filter
Deletes a crawl filter object by GUID using DELETE /v1.0/tenants/[tenant-guid]/crawlfilters/[crawlfilter-guid]
. This operation permanently removes the crawl filter configuration from the system. Use with caution as this action cannot be undone.
Important Note: Ensure no active crawl plans are using this filter before deletion, as this will break crawl operations.
Request Parameters
- crawlfilter-guid (string, Path, Required): GUID of the crawl filter object to delete
curl --location --request DELETE 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlfilters/00000000-0000-0000-0000-000000000000' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const deleteCrawlFilter = async () => {
try {
const response = await api.CrawlFilter.delete(
"<crawlfilter-guid>"
);
console.log(response, "Crawl filter deleted successfully");
} catch (err) {
console.log("Error deleting Crawl filter:", err);
}
};
deleteCrawlFilter();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def deleteCrawlFilter():
crawlFilter = crawler.CrawlFilter.delete("<crawlfilter-guid>")
print(crawlFilter)
deleteCrawlFilter()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
bool deleted = await sdk.CrawlFilter.Delete(Guid.Parse("<crawlfilter-guid>"));
Response
Returns 200 No Content on successful deletion. No response body is returned.
Check Crawl Filter Existence
Verifies if a crawl filter object exists without retrieving its configuration using HEAD /v1.0/tenants/[tenant-guid]/crawlfilters/[crawlfilter-guid]
. This is an efficient way to check filter presence before performing operations.
Request Parameters
- crawlfilter-guid (string, Path, Required): GUID of the crawl filter object to check
curl --location --head 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlfilters/00000000-0000-0000-0000-000000000000' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";
const api = new ViewCrawlerSdk(
"http://localhost:8000/", //endpoint
"default", //tenant Id
"default", //access key
);
const existsCrawlFilter = async () => {
try {
const response = await api.CrawlFilter.exists(
"<crawlfilter-guid>"
);
console.log(response, "Crawl filter exists");
} catch (err) {
console.log("Error checking Crawl filter:", err);
}
};
existsCrawlFilter();
import view_sdk
from view_sdk import crawler
sdk = view_sdk.configure( access_key="default",base_url="localhost", tenant_guid= "<tenant-guid>")
def existsCrawlFilter():
crawlFilter = crawler.CrawlFilter.exists("<crawlfilter-guid>")
print(crawlFilter)
existsCrawlFilter()
using View.Sdk;
using View.Crawler;
ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"),
"default",
"http://view.homedns.org:8000/");
bool exists = await sdk.CrawlFilter.Exists(Guid.Parse("<crawlfilter-guid>"));
Response
- 200 No Content: Crawl filter exists
- 404 Not Found: Crawl filter does not exist
- No response body: Only HTTP status code is returned
Note: HEAD requests do not return a response body, only the HTTP status code indicating whether the crawl filter exists.
Best Practices
When managing crawl filters in the View platform, consider the following recommendations for optimal content discovery and data ingestion:
- Size Optimization: Configure appropriate size limits to balance comprehensive crawling with performance optimization
- Content Type Filtering: Use specific content types to focus crawling on relevant data and reduce processing overhead
- Directory Traversal: Enable subdirectory crawling for comprehensive content discovery while considering performance impact
- Template Reusability: Create reusable filter templates that can be shared across multiple crawl plans for consistency
- Performance Monitoring: Monitor crawl performance and adjust filter parameters based on actual crawling results
Next Steps
After successfully configuring crawl filters, you can:
- Crawl Plans: Create crawl plans that reference your configured filters for automated data ingestion
- Data Repositories: Set up data repositories and configure them to work with your crawl filters
- Crawl Jobs: Execute crawl jobs using your configured filters to begin content discovery and ingestion
- Performance Tuning: Monitor and optimize crawl performance based on filter effectiveness and resource usage
- Integration: Integrate crawl filters with other View platform services for comprehensive data processing workflows