Saturday, January 2, 2016

Troubleshoot Search Crawl Issues ( Best Possible ways)

To View Crawl Log
·         Verify that the user account that is performing this procedure is an administrator for the Search service application.
·         In Central Administration, in the Quick Launch, click Application Management.
·         On the Application Management page, under Service Applications, click Manage service applications.
·         On the Service Applications page, in the list of service applications, click the Search service application that you want.
·         On the Search Administration page, in the Quick Launch, under Crawling, click Crawl Log.
·         On the Crawl Log – Content Source page, click the view that you want.






Below fields you can see in Crawl Log
The Content Source, Host Name, and Crawl History views show data in the following columns:
  • Successes. Items that were successfully crawled and searchable.
  • Warnings. Items that might not have been successfully crawled and might not be searchable.
  • Errors. Items that were not successfully crawled and might not be searchable.
  • Deletes. Items that were removed from the index and are no longer searchable.
  • Top Level Errors. Errors in top-level documents, including start addresses, virtual servers, and content databases. Every top-level error is counted as an error, but not all errors are counted as top-level errors. Because the Errors column includes the count from the Top Level Errors column, top-level-errors are not counted again in the Host Name view.
  • Not Modified. Items that were not modified between crawls.
  • Security Update. Items whose security settings were crawled because they were modified.

Crawl Log Timer Job
            By default, the data for each view in the crawl log is refreshed every five minutes by the timer job Crawl Log Report for Search Application <Search Service Application name>. You can change the refresh rate for this timer job, but in general, this setting should remain as is.
To check the status of the crawl log timer job
  1. Verify that the user account that is performing this procedure is a member of the Farm Administrators SharePoint group.
  2. In Central Administration, in the Monitoring section, click Check job status.
  3. On the Timer Job Status page, click Job History.
  4. On the Job History page, find Crawl Log Report for Search Application <Search Service Application name> for the Search service application that you want and review the status.
To change the refresh rate for the crawl log timer job
  1. Verify that the user account that is performing this procedure is a member of the Farm Administrators SharePoint group.
  2. In Central Administration, in the Monitoring section, click Check job status.
  3. On the Timer Job Status page, click Job History.
  4. On the Job History page, click Crawl Log Report for Search Application <Search Service Application name> for the Search service application that you want.
  5. On the Edit Timer Job page, in the Recurring Schedule section, change the timer job schedule to the interval that you want.
  6. Click OK.





Troubleshoot Crawl Problems:

First we need to see all servers in the farm are in the same level ( All cumulative updates, service packs are in same in all the farm and need to look into recent upgrade error log file in uls logs) . If you find any errors in log file, need to solve those issues and run psconfig again.

This section provides information about common crawl log errors, crawler behavior, and actions to take to maintain a healthy crawling environment.

When An Item is from the Index

When a crawler cannot find an item that exists in the index because the URL is obsolete or it cannot be accessed due to a network outage, the crawler reports an error for that item in that crawl. If this continues during the next three crawls, the item is deleted from the index. For file-share content sources, items are immediately deleted from the index when they are deleted from the file share.

“Object could not be found” Error for a file Share
This error can result from a crawled file-share content source that contains a valid host name but an invalid file name. For example, with a host name and file name of \\ValidHost\files\file1, \\ValidHost exists, but the file file1 does not. In this case, the crawler reports the error "Object could not be found" and deletes the item from the index. The Crawl History view shows:
  • Error: 1
  • Deletes: 1
  • Top Level Errors: 1 (\\ValidHost\files\file1 shows as a top-level error because it is a start address)
The Content Source view shows:
  • Errors: 0
  • Deletes: 0
  • Top Level Errors: 0
The Content Source view will show all zeros because it only shows the status of items that are in the index, and this start address was not entered into the index. However, the Crawl History view shows all crawl transactions, whether or not they are entered into the index.

“Network path for item could not be resolved “error for a file share
This error can result from a crawled file-share content source that contains an invalid host name and an invalid file name. For example, with a host name and file name of \\InvalidHost\files\file1, both \\InvalidHost and the file file1 do not exist. In this case, the crawler reports the error "Network path for item could not be resolved" and does not delete the item from the index. The Crawl History view shows:
  • Errors: 1
  • Deletes: 0
  • Top Level Errors: 1 (\\InvalidHost\files\file1 shows as a top-level error because it is a start address)
The Content Source view shows:
  • Error: 0
  • Deletes: 0
  • Top Level Errors: 0
The item is not deleted from the index, because the crawler cannot determine if the item really does not exist or if there is a network outage that prevents the item from being accessed.

Obsolete Start Address

The crawl log reports top-level errors for top-level documents, or start addresses. To ensure healthy content sources, you should take the following actions:
  • Always investigate non-zero top-level errors.
  • Always investigate top-level errors that appear consistently in the crawl log.
  • Otherwise, we recommend that you remove obsolete start addresses every two weeks after contacting the owner of the site.
To troubleshoot and delete obsolete start addresses
  1. Verify that the user account that is performing this procedure is an administrator for the Search service application.
  2. When you have determined that a start address might be obsolete, first determine whether it exists or not by pinging the site. If you receive a response, determine which of the following issues caused the problem:
    • If you can access the URL from a browser, the crawler could not crawl the start address because there were problems with the network connection.
    • If the URL is redirected from a browser, you should change the start address to be the same as the new address.
    • If the URL receives an error in a browser, try again at another time. If it still receives an error after multiple tries, contact the site owner to ensure that the site is available.
  3. If you do not receive a response from pinging the site, the site does not exist and should be deleted. Confirm this with the site owner before you delete the site.

Access Denied

When the crawl log continually reports an "Access Denied" error for a start address, the content access account might not have Read permissions to crawl the site. If you are able to view the URL with an administrative account, there might be a problem with how the permissions were updated. In this case, you should contact the site owner to request permissions. 

Number Set to zero in content source view during host distribution
During a host distribution, the numbers in all columns in Content Source view are set to zero. This happens because the numbers in Content Source view are sourced directly from the crawl database tables. During a host distribution, the data from these tables are being moved, so the values remain at zero during the duration of the host distribution.
After the host distribution is complete, run an incremental crawl of the content sources in order to restore the original numbers.

Showing File Shares deletes in Content Source View
When documents are deleted from a file-share content source that was successfully crawled, they are immediately deleted from the index during the next full or incremental crawl. These items will show as errors in the Content Source view of the crawl log, but will show as deletes in other views.

Stopping or restarting the SharePoint server search service causes crawl log transaction discrepancy

The SharePoint Server Search service (OSearch14) might be reset or restarted due to administrative operations or server functions. When this occurs, a discrepancy in the crawl history view of the crawl log can occur. You may notice a difference between the number of transactions reported per crawl and the actual number of transactions performed per crawl. This can occur because the OSearch14 service stores active transactions in memory and writes these transactions after they are completed. If the OSearch14 service is stopped, reset, or restarted before the in-memory transactions have been written to the crawl log database, the number of transactions per crawl will be shown incorrectly.

Deleted by Gatherer this item was deleted because its parent was deleted
  The best possible ways to troubleshoot this issue is

Multiple Crawls overlapping cause results to be deleted
I have read in one article/blog that this error can occur if multiple crawls that occur at the same time overlap. It may lead to a collision with the results getting put into the search index. If this happens too often the content will be removed as being unseen. There is a three day limit on keeping results and if the content is not found again within that time then it will be removed. As an example the content and the people queries may be running at the same time.  I'm not convinced by this proposition, I had the system setup to have just one incremental crawl running, and the problem still occurred, also the system always waited for each incremental crawl to finish before starting another one (eg when there is just a few minutes between them in the schedule).
SharePoint crawler can’t access SharePoint website to crawl it
By design the local SharePoint sites crawl uses the default url to access the SharePoint sites, the crawler (running from the index server) will attempt to visit the public/default URL. Some blog/forums explains that "deleted by the gatherer" mostly occurs when the crawler is not able to access the SharePoint website at the time of crawling.  In addition I've seen it stated that the indexer visits the default view for a list in order to index that list. 
There might be a number of reasons for why the indexer can't reach a SharePoint URL:
·         Connectivity issues, DNS issues
you could get the issue if you have DNS issues when then indexer is trying to crawl, if the indexer cannot resolve links correctly or cannot verify credentials in AD.

I've seen a couple of forums that suggests it resides with an intermittent DNS resolution issue and is nothing to do with SharePoint configuration.    Another user noticed an issue whereby the SharePoint crawler was not crawling any sites that had a DNS alias to itself.  eg, server name was myserver01 and there was a DNS alias called myportal.  The crawler would not crawlhttp://myportal and anything under it. 
·         SharePoint server under heavy load
The problem may be caused when the server is under heavy load or when very large documents are being indexed, thus requiring more time for the indexer to access the site.

Check the index server can reach the URL

I've had inter-server communication issues before; when the problem occurs make sure all the Sharepoint servers can ping each other, and make sure the index server can reach the public URL (open IE on the index server and check it out). Alternatively setup or write some monitoring tool on the index server that checks connectivity regularly and logs it until the issue appears.
Increase time out values?
One forum suggested increasing your timeout values for search, since the problem may be caused by indexer being too busy or difficulty reaching the url. If it fails to complete indexing of a document for example, it will remove it from the gatherer process as incomplete.  The setting in Central Administration 2013 can be found at:
Application Management > Manage Services on Server > Sharepoint Server Search > Search Service Application and then on the Farm Search Administration page there is a Time-Out (Seconds) settings which you can try to increase.

Normally these are set to 60 seconds. By changing these to 120 or longer you will give the search service some extra time to connect and index an item.  I tried increasing this substantially but found it did not help.
Check Alternate Access Mappings
Certain users have reported the problem happening as a result of incorrect configuration of the Alternate Access Mappings in Central Admin.  If your default Alternate Access Mapping is not the same as the one in your Content Source you could have this issue. Check the Alternate Access Mappings and the Content Source are the same.

Check that the site or list is configured to allow it to be in search results

For each site check that the "Allow this site to appear in search results?" is on in the Search and Offline Availability options.  If it is a list or document library that is the problem, also check that the library is allowed to be indexed, under the advanced settings of the library/list and click the button to reindex the list on its next run.

Change the search schedule?

If there are issues with overlapping crawls (multiple overlapping crawls above), then adjust the schedules and doing some restarts of the services and see if the content being collected in the search index again

DNS Loopback issue

Certain sites have recommended disabling the loopback check.  Please note that I advise you to read around thoroughly on disable loopback checks on production Sharepoint servers before you decide whether or not to do this, there is various pages of advice on this. 
·         Open regedit
·         Go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa
·         Under Lsa, create a DWORD value called DisableLoopbackCheck
·         In Value, type 1
·         Close regedit

Recreate the content source

One suggested option is to create a new content source and doing a full crawl fixed the issue - again however not ideal in a production environment.  Alternatively delete/re-create the content source and run a full crawl. I don't think you can reset the index on specific content source unless you have dedicated crawl just for that source, which of course the architecture permits.

Create an include rule

Create a include rule in the Central Administration to force the inclusion of this site collection/site.

Kerberos issues

If you've set up anything aside from Integrated Windows Authentication, you'll have to work harder to get your crawler working.  Some issues are related to Kerberos. If you don't have the infrastructure update applied, then SharePoint will not be able to use Kerberos auth to web sites with non-default (80/443) ports.

Recommendation from a Microsoft person

I was contacted by another Sharepoint user via this blog who got Microsoft to investigate their issue, and they came up with the following proposed steps to fixing it:
The list of Alternate Access Mappings in Central Administration must match the bindings list in IIS exactly. There should be no extra bindings in IIS that don’t have AAMs.
1.     Removed the extra binding statement for from IIS.
2.     IISRESET /noforce
3.     Go to central admin; Manage Service Applications; click your Search App – "Search Service Application 1", click "Crawl log" – notice the Deleted column.
4.     Click the link – the one in the Deleted column that lists the number of deleted items.
5.     Look for any rows that have the Deleted by the gatherer (This item was deleted because its parent was deleted) message and note the url.
6.     Navigate to that site and to site contents.
7.     Add a document library and a document into it. (This is necessary!)
8.     Retry search.
9.     Repeat steps for each site collection.


3 comments:

  1. Outstanding! What a wonderful content you've written on SharePoint.Thank you so much for sharing your knowlege on SharePoint Developer with us. Please keep sharing such as great content in future.

    ReplyDelete
  2. I just want to thank you for sharing your information and your site or blog this is simple but nice Information I’ve ever seen i like it i learn something today. List Crawler

    ReplyDelete
  3. Interesting Article. Hoping that you will continue posting an article having a useful information. Best Anal Sex Videos

    ReplyDelete