Crawl Plans

Comprehensive guide to View's crawl plan management system, including automated data ingestion workflows, repository mapping, schedule integration, and processing configuration for efficient content discovery and data processing.

Overview

The View Crawl Plan management system provides comprehensive configuration for automated data ingestion workflows. Crawl plans serve as orchestration templates that map data repositories to crawl schedules and filters, defining the complete parameters for how data should be discovered, processed, and ingested into the View platform.

Key Features

  • Repository Mapping: Complete integration with data repositories for source data access
  • Schedule Integration: Automated execution based on configurable crawl schedules
  • Filter Application: Content filtering and size constraints through crawl filters
  • Processing Configuration: Integration with metadata and embeddings processing rules
  • Enumeration Management: Configurable enumeration storage and retention policies
  • Parallel Processing: Optimized parallel task execution for efficient data processing
  • Change Detection: Automated processing of additions, updates, and deletions
  • Workflow Orchestration: Complete automation of data ingestion and processing pipelines

Supported Operations

  • Create: Create new crawl plan configurations with repository and processing mappings
  • Read: Retrieve individual crawl plan configurations and metadata
  • Enumerate: List all crawl plans with pagination support
  • Update: Modify existing crawl plan configurations and settings
  • Delete: Remove crawl plan configurations and associated workflows
  • Existence Check: Verify crawl plan presence without retrieving details

API Endpoints

Crawl plans are managed via the Crawler server API at [http|https]://[hostname]:[port]/v1.0/tenants/[tenant-guid]/crawlplans

Supported HTTP Methods: GET, HEAD, PUT, DELETE

Important: All crawl plan operations require appropriate authentication tokens.

Crawl Plan Object Structure

Crawl plan objects contain comprehensive configuration for automated data ingestion workflows. Here's the complete structure:

{
    "GUID": "4292118d-3397-4090-88c6-90f1886a3e35",
    "TenantGUID": "default",
    "DataRepositoryGUID": "c854f5f2-68f6-44c4-813e-9c1dea51676a",
    "CrawlScheduleGUID": "oneminute",
    "CrawlFilterGUID": "default",
    "MetadataRuleGUID": "example-metadata-rule",
    "EmbeddingsRuleGUID": "crawler-embeddings-rule",
    "Name": "Local files",
    "EnumerationDirectory": "./enumerations/",
    "EnumerationsToRetain": 16,
    "MaxDrainTasks": 4,
    "ProcessAdditions": true,
    "ProcessDeletions": true,
    "ProcessUpdates": true,
    "CreatedUtc": "2024-10-23T15:14:26.000000Z"
}

Field Descriptions

  • GUID (GUID): Globally unique identifier for the crawl plan object
  • TenantGUID (GUID): Globally unique identifier for the tenant
  • DataRepositoryGUID (GUID): Globally unique identifier for the data repository to crawl
  • CrawlScheduleGUID (GUID): Globally unique identifier for the crawl schedule
  • CrawlFilterGUID (GUID): Globally unique identifier for the crawl filter
  • MetadataRuleGUID (GUID): Globally unique identifier for the metadata processing rule
  • EmbeddingsRuleGUID (GUID): Globally unique identifier for the embeddings processing rule
  • Name (string): Display name for the crawl plan
  • EnumerationDirectory (string): Directory path for storing previous enumerations
  • EnumerationsToRetain (integer): Number of enumeration snapshots to retain
  • MaxDrainTasks (integer): Maximum number of parallel processing tasks
  • ProcessAdditions (boolean): Whether to process newly added files
  • ProcessDeletions (boolean): Whether to process deleted files
  • ProcessUpdates (boolean): Whether to process updated files
  • CreatedUtc (datetime): UTC timestamp when the crawl plan was created

Important Notes

  • Workflow Orchestration: Crawl plans define complete data ingestion workflows
  • Repository Integration: Plans must reference valid data repositories for source data access
  • Schedule Dependencies: Plans require valid crawl schedules for automated execution
  • Processing Rules: Integration with metadata and embeddings rules enables advanced data processing

Create Crawl Plan

Creates a new crawl plan configuration using PUT /v1.0/tenants/[tenant-guid]/crawlplans. This endpoint allows you to define automated data ingestion workflows by mapping data repositories to crawl schedules and filters.

Request Parameters

Required Parameters

  • DataRepositoryGUID (string, Body, Required): GUID of the data repository to crawl
  • CrawlScheduleGUID (string, Body, Required): GUID of the crawl schedule for execution timing
  • CrawlFilterGUID (string, Body, Required): GUID of the crawl filter for content filtering
  • Name (string, Body, Required): Display name for the crawl plan

Optional Parameters

  • MetadataRuleGUID (string, Body, Optional): GUID of the metadata processing rule
  • EmbeddingsRuleGUID (string, Body, Optional): GUID of the embeddings processing rule
  • EnumerationDirectory (string, Body, Optional): Directory path for storing enumerations (defaults to "./enumerations/")
  • EnumerationsToRetain (integer, Body, Optional): Number of enumeration snapshots to retain (defaults to 16)
  • MaxDrainTasks (integer, Body, Optional): Maximum parallel processing tasks (defaults to 4)
  • ProcessAdditions (boolean, Body, Optional): Whether to process new files (defaults to true)
  • ProcessDeletions (boolean, Body, Optional): Whether to process deleted files (defaults to true)
  • ProcessUpdates (boolean, Body, Optional): Whether to process updated files (defaults to true)

Important Notes

  • Repository Dependencies: Ensure the data repository exists and is accessible before creating the plan
  • Schedule Configuration: Verify the crawl schedule is properly configured for your requirements
  • Filter Application: Configure appropriate crawl filters to optimize processing performance
  • Processing Rules: Set up metadata and embeddings rules for advanced data processing capabilities
curl -X PUT http://localhost:8601/v1.0/tenants/[tenant-guid]/crawlschedules \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer [accesskey]" \
     -d '
{
    "DataRepositoryGUID": "e9068089-4c90-4ef7-b4bb-bafccb771a9c",
    "CrawlScheduleGUID": "default",
    "CrawlFilterGUID": "default",
    "MetadataRuleGUID": "example-metadata-rule",
    "EmbeddingsRuleGUID": "example-embeddings-rule",
    "Name": "My crawl plan",
    "EnumerationDirectory": "./enumerations/",
    "EnumerationsToRetain": 30,
    "MaxDrainTasks": 4,
    "ProcessAdditions": true,
    "ProcessDeletions": true,
    "ProcessUpdates": true
}'
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const createCrawlPlan = async () => {
  try {
    const response = await api.CrawlPlan.create({
      DataRepositoryGUID: "<datarepository-guid>",
      CrawlScheduleGUID: "<crawlschedule-guid>",
      CrawlFilterGUID: "<crawlfilter-guid>",
      Name: "My crawl plan [ASH]",
      EnumerationDirectory: "./enumerations/",
      EnumerationsToRetain: 30,
      MetadataRuleGUID: "<metadatarule-guid>",
      ProcessingEndpoint:
        "http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing",
      ProcessingAccessKey: "default",
      CleanupEndpoint:
        "http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/cleanup",
      CleanupAccessKey: "default",
    });
    console.log(response, "Crawl plan created successfully");
  } catch (err) {
    console.log("Error creating Crawl plan:", err);
  }
};

createCrawlPlan();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)


def createCrawlPlan():
    crawlPlan = crawler.CrawlPlan.create(
        DataRepositoryGUID="00000000-0000-0000-0000-000000000000",
        CrawlScheduleGUID="00000000-0000-0000-0000-000000000000",
        CrawlFilterGUID="00000000-0000-0000-0000-000000000000",
        Name="My crawl plan",
        EnumerationDirectory="./enumerations/",
        EnumerationsToRetain=30,
        MetadataRuleGUID="00000000-0000-0000-0000-000000000000",
        EmbeddingsRuleGUID="00000000-0000-0000-0000-000000000000",
        MaxDrainTasks=10,
        ProcessAdditions=True,
        ProcessDeletions=True,
        ProcessUpdates=True,
        ProcessingEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing",
        ProcessingAccessKey="default",
        CleanupEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/cleanup",
        CleanupAccessKey="default",
    )
    print(crawlPlan)

createCrawlPlan()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");

CrawlPlan plan = new CrawlPlan
{
  crawlPlan = crawler.CrawlPlan.create(
  DataRepositoryGUID="<datarepository-guid>",
  CrawlScheduleGUID="<crawlschedule-guid>",
  CrawlFilterGUID="<crawlfilter-guid>",
  Name="My crawl plan",
  EnumerationDirectory="./enumerations/",
  EnumerationsToRetain=30,
  MetadataRuleGUID="<metadatarule-guid>",
  ProcessingEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing",
  ProcessingAccessKey="default",
  CleanupEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/cleanup",
  CleanupAccessKey="default"
};

CrawlPlan response =  await sdk.CrawlPlan.Create(plan);

Response

Returns the created crawl plan object with all configuration details:

{
    "GUID": "4292118d-3397-4090-88c6-90f1886a3e35",
    "TenantGUID": "default",
    "DataRepositoryGUID": "c854f5f2-68f6-44c4-813e-9c1dea51676a",
    "CrawlScheduleGUID": "oneminute",
    "CrawlFilterGUID": "default",
    "MetadataRuleGUID": "example-metadata-rule",
    "EmbeddingsRuleGUID": "crawler-embeddings-rule",
    "Name": "My crawl plan",
    "EnumerationDirectory": "./enumerations/",
    "EnumerationsToRetain": 30,
    "MaxDrainTasks": 4,
    "ProcessAdditions": true,
    "ProcessDeletions": true,
    "ProcessUpdates": true,
    "CreatedUtc": "2024-10-23T15:14:26.000000Z"
}

Enumerate Crawl Plans

Retrieves a paginated list of all crawl plan objects in the tenant using GET /v2.0/tenants/[tenant-guid]/crawlplans. This endpoint provides comprehensive enumeration with pagination support for managing multiple crawl plan configurations.

Request Parameters

No additional parameters required beyond authentication.

curl --location 'http://view.homedns.org:8000/v2.0/tenants/00000000-0000-0000-0000-000000000000/crawlplans/' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const enumerateCrawlPlans = async () => {
  try {
    const response = await api.CrawlPlan.enumerate();
    console.log(response, "Crawl plans fetched successfully");
  } catch (err) {
    console.log("Error fetching Crawl plans:", err);
  }
};

enumerateCrawlPlans();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)

def enumerateCrawlPlans():
    crawlPlans = crawler.CrawlPlan.enumerate()
    print(crawlPlans)

enumerateCrawlPlans()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");
EnumerationResult<CrawlPlan> response = await sdk.CrawlPlan.Enumerate();

Response Structure

The enumeration response includes pagination metadata and crawl plan objects with complete configuration details:

Response

Returns a paginated list of crawl plan objects:

{
    "Success": true,
    "Timestamp": {
        "Start": "2024-10-21T02:36:37.677751Z",
        "TotalMs": 23.58,
        "Messages": {}
    },
    "MaxResults": 10,
    "IterationsRequired": 1,
    "EndOfResults": true,
    "RecordsRemaining": 0,
    "Objects": [
        {
            "GUID": "4292118d-3397-4090-88c6-90f1886a3e35",
            "TenantGUID": "default",
            "DataRepositoryGUID": "c854f5f2-68f6-44c4-813e-9c1dea51676a",
            "CrawlScheduleGUID": "oneminute",
            "CrawlFilterGUID": "default",
            "MetadataRuleGUID": "example-metadata-rule",
            "EmbeddingsRuleGUID": "crawler-embeddings-rule",
            "Name": "Local files",
            "EnumerationDirectory": "./enumerations/",
            "EnumerationsToRetain": 16,
            "MaxDrainTasks": 4,
            "ProcessAdditions": true,
            "ProcessDeletions": true,
            "ProcessUpdates": true,
            "CreatedUtc": "2024-10-23T15:14:26.000000Z"
        }
    ],
    "ContinuationToken": null
}

Read Crawl Plan

Retrieves crawl plan configuration and metadata by GUID using GET /v1.0/tenants/[tenant-guid]/crawlplans/[crawlplan-guid]. Returns the complete crawl plan configuration including repository mappings, schedule settings, and processing rules. If the plan doesn't exist, a 404 error is returned.

Request Parameters

  • crawlplan-guid (string, Path, Required): GUID of the crawl plan object to retrieve
curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlplans/00000000-0000-0000-0000-000000000000' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const readCrawlPlan = async () => {
  try {
    const response = await api.CrawlPlan.read(
      "<crawlplan-guid>"
    );
    console.log(response, "Crawl plan fetched successfully");
  } catch (err) {
    console.log("Error fetching Crawl plan:", err);
  }
};

readCrawlPlan();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)

def readCrawlPlan():
    crawlPlan = crawler.CrawlPlan.retrieve("<crawlplan-guid>")
    print(crawlPlan)

readCrawlPlan()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");
CrawlPlan response = await sdk.CrawlPlan.Retrieve(Guid.Parse("<crawlPlan-guid>"));

Response

Returns the complete crawl plan configuration:

{
    "GUID": "4292118d-3397-4090-88c6-90f1886a3e35",
    "TenantGUID": "default",
    "DataRepositoryGUID": "c854f5f2-68f6-44c4-813e-9c1dea51676a",
    "CrawlScheduleGUID": "oneminute",
    "CrawlFilterGUID": "default",
    "MetadataRuleGUID": "example-metadata-rule",
    "EmbeddingsRuleGUID": "crawler-embeddings-rule",
    "Name": "Local files",
    "EnumerationDirectory": "./enumerations/",
    "EnumerationsToRetain": 16,
    "MaxDrainTasks": 4,
    "ProcessAdditions": true,
    "ProcessDeletions": true,
    "ProcessUpdates": true,
    "CreatedUtc": "2024-10-23T15:14:26.000000Z"
}

Note: the HEAD method can be used as an alternative to get to simply check the existence of the object. HEAD requests return either a 200/OK in the event the object exists, or a 404/Not Found if not. No response body is returned with a HEAD request.

Read All Crawl Plans

Retrieves all crawl plan objects in the tenant using GET /v1.0/tenants/[tenant-guid]/crawlplans/. Returns an array of crawl plan objects with complete configuration details for all plans in the tenant.

Request Parameters

No additional parameters required beyond authentication.

curl --location 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlplans/' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const readAllCrawlPlans = async () => {
  try {
    const response = await api.CrawlPlan.readAll();
    console.log(response, "All crawl plans fetched successfully");
  } catch (err) {
    console.log("Error fetching All crawl plans:", err);
  }
};

readAllCrawlPlans();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)

def readAllCrawlPlans():
    crawlPlans = crawler.CrawlPlan.retrieve_all()
    print(crawlPlans)

readAllCrawlPlans()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");
List<CrawlPlan> response = await sdk.CrawlPlan.RetrieveMany();

Response

Returns an array of all crawl plan objects:

[
    {
        "GUID": "4292118d-3397-4090-88c6-90f1886a3e35",
        "TenantGUID": "default",
        "DataRepositoryGUID": "c854f5f2-68f6-44c4-813e-9c1dea51676a",
        "CrawlScheduleGUID": "oneminute",
        "CrawlFilterGUID": "default",
        "MetadataRuleGUID": "example-metadata-rule",
        "EmbeddingsRuleGUID": "crawler-embeddings-rule",
        "Name": "Local files",
        "EnumerationDirectory": "./enumerations/",
        "EnumerationsToRetain": 16,
        "MaxDrainTasks": 4,
        "ProcessAdditions": true,
        "ProcessDeletions": true,
        "ProcessUpdates": true,
        "CreatedUtc": "2024-10-23T15:14:26.000000Z"
    },
    {
        "GUID": "another-crawl-plan",
        "TenantGUID": "default",
        "DataRepositoryGUID": "another-repository-guid",
        "CrawlScheduleGUID": "hourly",
        "CrawlFilterGUID": "large-files-filter",
        "MetadataRuleGUID": "production-metadata-rule",
        "EmbeddingsRuleGUID": "production-embeddings-rule",
        "Name": "Production crawl plan",
        "EnumerationDirectory": "./enumerations/production/",
        "EnumerationsToRetain": 30,
        "MaxDrainTasks": 8,
        "ProcessAdditions": true,
        "ProcessDeletions": true,
        "ProcessUpdates": true,
        "CreatedUtc": "2024-10-24T10:30:15.123456Z"
    }
]

Update Crawl Plan

Updates an existing crawl plan configuration using PUT /v1.0/tenants/[tenant-guid]/crawlplans/[crawlplan-guid]. This endpoint allows you to modify crawl plan parameters while preserving certain immutable fields.

Request Parameters

  • crawlplan-guid (string, Path, Required): GUID of the crawl plan object to update

Updateable Fields

All configuration parameters can be updated except for:

  • GUID: Immutable identifier
  • TenantGUID: Immutable tenant association
  • CreatedUtc: Immutable creation timestamp

Important Notes

  • Field Preservation: Certain fields cannot be modified and will be preserved across updates
  • Complete Object: Provide a fully populated object in the request body
  • Configuration Validation: All updated parameters will be validated before applying changes
  • Workflow Impact: Consider the impact of plan changes on existing crawl operations

Request body:

curl --location --request PUT 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlplans/00000000-0000-0000-0000-000000000000' \
--header 'content-type: application/json' \
--header 'Authorization: ••••••' \
--data '{
    "DataRepositoryGUID": "00000000-0000-0000-0000-000000000000",
    "CrawlScheduleGUID": "00000000-0000-0000-0000-000000000000",
    "CrawlFilterGUID": "00000000-0000-0000-0000-000000000000",
    "Name": "My updated crawl plan",
    "EnumerationDirectory": "./enumerations/",
    "EnumerationsToRetain": 30,
    "MetadataRuleGUID": "00000000-0000-0000-0000-000000000000",
    "ProcessingEndpoint": "http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing",
    "ProcessingAccessKey": "default",
    "CleanupEndpoint": "http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/cleanup",
    "CleanupAccessKey": "default"
}'
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);


const updateCrawlPlan = async () => {
  try {
    const response = await api.CrawlPlan.update({
      GUID: "<crawlplan-guid>",
      TenantGUID: "<tenant-guid>",
      DataRepositoryGUID: "<datarepository-guid>",
      CrawlScheduleGUID: "<crawlschedule-guid>",
      CrawlFilterGUID: "<crawlfilter-guid>",
      MetadataRuleGUID: "<metadatarule-guid>",
      EmbeddingsRuleGUID: "<metadatarule-guid>",
      Name: "Traeger Recipe Forums [UPDATED]",
      EnumerationDirectory: "./enumerations/",
      EnumerationsToRetain: 16,
      MaxDrainTasks: 4,
      ProcessAdditions: true,
      ProcessDeletions: true,
      ProcessUpdates: true,
      CreatedUtc: "2025-03-25T21:50:09.230321Z",
    });
    console.log(response, "Crawl plan updated successfully");
  } catch (err) {
    console.log("Error updating Crawl plan:", err);
  }
};

updateCrawlPlan();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)

def updateCrawlPlan():
    crawlPlan = crawler.CrawlPlan.update(
        "<crawlplan-guid>",
        DataRepositoryGUID="<datarepository-guid>",
        CrawlScheduleGUID="<crawlschedule-guid>",
        CrawlFilterGUID="<crawlfilter-guid>",
        Name="My crawl plan [updated]",
        EnumerationDirectory="./enumerations/",
        EnumerationsToRetain=30,
        MetadataRuleGUID="<metadatarule-guid>",
        ProcessingEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing",
        ProcessingAccessKey="default",
        CleanupEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/cleanup",
        CleanupAccessKey="default"
    )
    print(crawlPlan)

updateCrawlPlan()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");

CrawlPlan plan = new CrawlPlan
{
  GUID= "<crawlplan-guid>",
  TenantGUID= "<tenant-guid>",
  DataRepositoryGUID= "<datarepository-guid>",
  CrawlScheduleGUID= "<crawlschedule-guid>",
  CrawlFilterGUID= "<crawlfilter-guid>",
  MetadataRuleGUID= "<metadatarule-guid>",
  EmbeddingsRuleGUID= "<metadatarule-guid>",
  crawlPlan = crawler.CrawlPlan.create(
  DataRepositoryGUID="<datarepository-guid>",
  CrawlScheduleGUID="<crawlschedule-guid>",
  CrawlFilterGUID="<crawlfilter-guid>",
  Name="My crawl plan",
  EnumerationDirectory="./enumerations/",
  EnumerationsToRetain=30,
  MetadataRuleGUID="<metadatarule-guid>",
  ProcessingEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing",
  ProcessingAccessKey="default",
  CleanupEndpoint="http://nginx-processor:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/processing/cleanup",
  CleanupAccessKey="default"
};

CrawlPlan response = await sdk.CrawlPlan.Update(plan);

Response

Returns the updated crawl plan object with all configuration details:

{
    "GUID": "4292118d-3397-4090-88c6-90f1886a3e35",
    "TenantGUID": "default",
    "DataRepositoryGUID": "c854f5f2-68f6-44c4-813e-9c1dea51676a",
    "CrawlScheduleGUID": "oneminute",
    "CrawlFilterGUID": "default",
    "MetadataRuleGUID": "example-metadata-rule",
    "EmbeddingsRuleGUID": "crawler-embeddings-rule",
    "Name": "My updated local files",
    "EnumerationDirectory": "./enumerations/",
    "EnumerationsToRetain": 16,
    "MaxDrainTasks": 4,
    "ProcessAdditions": true,
    "ProcessDeletions": true,
    "ProcessUpdates": true,
    "CreatedUtc": "2024-10-23T15:14:26.000000Z"
}

Delete Crawl Plan

Deletes a crawl plan object by GUID using DELETE /v1.0/tenants/[tenant-guid]/crawlplans/[crawlplan-guid]. This operation permanently removes the crawl plan configuration from the system. Use with caution as this action cannot be undone.

Important Note: Ensure no active crawl operations are using this plan before deletion, as this will break ongoing crawl executions.

Request Parameters

  • crawlplan-guid (string, Path, Required): GUID of the crawl plan object to delete
curl --location --request DELETE 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlplans/00000000-0000-0000-0000-000000000000' \
--header 'Authorization: ••••••' 
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);


const deleteCrawlPlan = async () => {
  try {
    const response = await api.CrawlPlan.delete(
      "<crawlplan-guid>"
    );
    console.log(response, "Crawl plan deleted successfully");
  } catch (err) {
    console.log("Error deleting Crawl plan:", err);
  }
};
deleteCrawlPlan();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)

def deleteCrawlPlan():
    crawlPlan = crawler.CrawlPlan.delete("<crawlplan-guid>")
    print(crawlPlan)

deleteCrawlPlan()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");
bool deleted = await sdk.CrawlPlan.Delete(Guid.Parse("<crawlPlan-guid>"));

Response

Returns 204 No Content on successful deletion. No response body is returned.

Check Crawl Plan Existence

Verifies if a crawl plan object exists without retrieving its configuration using HEAD /v1.0/tenants/[tenant-guid]/crawlplans/[crawlplan-guid]. This is an efficient way to check plan presence before performing operations.

Request Parameters

  • crawlplan-guid (string, Path, Required): GUID of the crawl plan object to check
curl --location --head 'http://view.homedns.org:8000/v1.0/tenants/00000000-0000-0000-0000-000000000000/crawlplans/00000000-0000-0000-0000-000000000000' \
--header 'Authorization: ••••••'
import { ViewCrawlerSdk } from "view-sdk";

const api = new ViewCrawlerSdk(
  "http://localhost:8000/" //endpoint
  "<tenant-guid>", //tenant Id
  "default" //access key
);

const existsCrawlPlan = async () => {
  try {
    const response = await api.CrawlPlan.exists(
      "<crawlplan-guid>"
    );
    console.log(response, "Crawl plan exists");
  } catch (err) {
    console.log("Error checking Crawl plan:", err);
  }
};

existsCrawlPlan();
import view_sdk
from view_sdk import crawler
from view_sdk.sdk_configuration import Service

sdk = view_sdk.configure(
    access_key="default",
    base_url="localhost", 
    tenant_guid="default",
    service_ports={Service.CRAWLER: 8000},
)

def existsCrawlPlan():
    crawlPlan = crawler.CrawlPlan.exists("<crawlplan-guid>")
    print(crawlPlan)

existsCrawlPlan()
using View.Sdk;
using View.Crawler;

ViewCrawlerSdk sdk = new ViewCrawlerSdk(Guid.Parse("00000000-0000-0000-0000-000000000000"), 
                                        "default", 
                                        "http://view.homedns.org:8000/");
bool exists = await sdk.CrawlPlan.Exists(Guid.Parse("<crawlPlan-guid>"));

Response

  • 200 No Content: Crawl plan exists
  • 404 Not Found: Crawl plan does not exist
  • No response body: Only HTTP status code is returned

Note: HEAD requests do not return a response body, only the HTTP status code indicating whether the crawl plan exists.

Best Practices

When managing crawl plans in the View platform, consider the following recommendations for optimal data ingestion workflow configuration:

  • Repository Validation: Ensure data repositories are properly configured and accessible before creating crawl plans
  • Schedule Optimization: Configure crawl schedules based on data update frequency and processing requirements
  • Filter Configuration: Use appropriate crawl filters to optimize processing performance and reduce unnecessary data handling
  • Processing Rules: Set up metadata and embeddings rules for comprehensive data processing and analysis
  • Performance Tuning: Monitor and adjust parallel processing settings based on system resources and data volume

Next Steps

After successfully configuring crawl plans, you can:

  • Crawl Operations: Monitor crawl plan executions and track processing performance through crawl operations
  • Data Repositories: Set up additional data repositories to expand your data ingestion capabilities
  • Crawl Schedules: Create and configure crawl schedules to define automated execution timing
  • Crawl Filters: Develop specialized crawl filters for different content types and processing requirements
  • Integration: Integrate crawl plans with other View platform services for comprehensive data processing workflows