How to force the crawler to crawl the entire content source again?
Currently, the crawler is running based on the sitemap. Only the updated URLs will be crawled. Is there any way that we can manually start the crawler to crawl the entire content source all over again whenever needed? Or, is there any way to upload a 'manipulated' sitemap file so that the crawler can crawl all the URLs listed in the sitemap?
0
-
Hi @... you can crawl the whole data manually from the manual crawling option in the admin dashboard whenever needed. Frequency crawling will only crawl the updated URLs in case of sitemap with lastmod tag.
following is a detailed explanation:
In website content source, we can have different type of inputs- file
- Url
in this type, we crawl URLs from a txt or a XML file
Url can be a website Url or a URL of a hosted sitemap. And a sitemap can also be a sitemap that has a lastmod (last modified timestamp) tag.
only for the last case ie a sitemap with lastmod tag we crawl only updated data based on the lastmod tag. Hence if a hosted sitemap is provided with lastmod tag, lastmod tag should be kept up to date. And for all other inputs, we always crawl the whole data in every crawl.0 -
Hi @... I think the message is still valid for a website content source. I uploaded the URL txt file and manually started crawling. During the crawler runtime, this content source becomes unsearchable until it's done. 0 -
Hi @...
As I have mentioned earlier as well, In website crawling we don't wipe the existing data in case of manual crawl. We crawl the new data in a temporary location and then when the crawling is complete we swap the temporary data with original data. So existing data is available for search during the crawling period, its just that new data will be available for search once the crawling is complete0
Please sign in to leave a comment.
Comments
3 comments