The issue mentioned in the heading occurs when you block the URL through the robots.txt file, but Google is still indexing it. These URLs are marked as “Valid with the warning” since it is unsure if it should index those URLs or not.
It is easier to re-check the URL status by going on the Coverage section. You can see the Indexed one to check the listed URLs if they are blocked by robots.txt or not. You can return two objects in the crawl section, such as where robots.txt has not blocked it, and the crawl is allowed, whereas in other cases where the robots.txt has blocked it, it will show the message as failed while fetching the page.
Once you have figured out that some of the URLs have been blocking, you can follow the below-mentioned points to resolve this issue.
- When you find those URLs, you can export their list from the Search Engine’s Console to sort them alphabetically.
- Once you have ordered the URLs alphabetically, then see if you want to index anyone of them. You can simply file robots.txt so that Google is allowed to access them.
- If you want anything to not appear on the search engine, you can keep the robots.txt as it was but do double-check if there is an internal link you might want to remove.
- You also have the option for further classification, which is you can let the URL appear in the search engine without indexing it. For this, you need to update the robots.txt file so that it follows no index directives.
- If you want to keep it private and don’t want anyone to access it, you can keep it in a staging environment.
- If you are not sure which section of the robots.txt is blocking the URL, you can test the robots.txt button for blockage. It will show in a new window which line is stopping Google from getting access to the URL.
We will further elaborate on how you can resolve the issue of blocked crawling. If you want to allow crawling but not indexing, then you can have the tag for noindex meta robots. In some cases, even if Google is crawling and the indexing is blocked. But it can still do indexing due to links without showing the tag for noindex meta. You should not have the tag for noindex meta if the URL is canonicalized to a different page. You only need to ensure that proper signals are in place for canonicalization with the appropriate tag on the page. This way, you can make sure signals are passing correctly and the consolidation is done right.
The one main reason why Google does not crawl the URL is due to robots.txt, but there can be other reasons such as the issue can be intermittent block, IP block, or user-agent block. We have discussed above how you can see the problem for robots.txt on Google Console. If you can’t get access to the console, you can get onto it through the navigation path, which is to write robots.txt with/after the domain name, this will give you access to the file.
Intermittent Block
If you find that the problem has been resolved, but it appears again, this can be the Intermittent Block case. In this case, the block is most likely due to a disallowing statement that you have to remove. The removal of it depends on the technology that you have.
If the issue affects the whole site on WordPress, then it is likely that you have turned on the settings to disallow tracking. This usually happens if the website is new or the website’s migration has recently been done. You need to follow three simple steps in this case. First, you should go to settings, then read, and make sure there is no check on “Search Engine Visibility.”
User-agent Block
As can be inferred from the name, this block is specific to either AhrefsBot or Googlebot. In short, a unique bot has been detected by the site, which is later blocking the correspondence with the user-agent. You can detect this problem if you view the page fine in your browser, but you get blocked as soon as you change the user-agent. This is the case when the block is based on any specific user.
You can use the Chrome dev tools if you particularly want to specify any user agent. Another option is to change the user agents through the usage of a browser extension.
IP Blocks
Once you have checked that either robots.txt or user-agent block does not block you, the reason can be due to IP block. It is comparatively more challenging to find this block than the two other blocks mentioned above. In case you have this issue, then you can contact CDN or your host provider. They can let you know the path or the root cause of that block and how you can resolve it.
We have discussed in detail what the blocked by robots.txt error is and how one can resolve it. We have also mentioned some other potential blocks that can cause the problem in accessing the URL.
Just a final thought is on how this issue affects the SEO of the site. If you have this issue, then it is difficult for the search engines to access your website’s content; in which case, your SEO will be compromised.