Website Robots.txt Writing and Elaboration

Robots.txt is a search engine to visit the site needs to see the first file, search engine spiders come to a site, first of all to check the site root directory exists robots.txt, if there is, it will be in accordance with the contents of the file to determine the scope of access, if the file does not exist will be randomly grabbed, may capture a duplicate path or error page, robots. txt writing example:

Blocking all search engines from accessing any part of the site, that is, preventing all engines from indexing your site.

User-agent: *
Disallow: /

Allow all search engines to access any link on your site

User-agent: *
Allow: /

Prevent a directory from being indexed by search engines

User-agent: *
Disallow: /directory/
Disallow: /directory/

Can not be written as Disallow: /directory name 1/ /directory name 2/ such a situation, each directory should be a separate line of special clarification.

Allow only one search engine to access your site.

User-agent: baiduspider
Allow:
User-agent: googlebot
Allow:

Restricting access to your site to only one search engine.

User-agent: baiduspider
Disallow: /
User-agent: googlebot
Disallow: /

Block search engines from accessing all dynamic pages on the site (a dynamic page is any page with a “?” in the URL). in the URL)

User-agent: *
Disallow: /*?*

Allow search engines to access web pages only in the form of a particular file extension.

User-agent: *
Allow: . Suffixed forms (e.g., .html, .htm, .php, etc.)$
Disallow: /

Allow search engines to access pages in specific directories.

User-agent: *
Allow: /Directory 1/Directory 2 (allows access to web pages in Directory 2)
Allow: /Directory 3/Directory 4 (allowing access to pages in Directory 4)
Disallow: /directory/
Disallow: /directory/

Limit search engine access to web pages with a particular file extension.

Prevent Search Engine from accessing files (not web pages, mind you) in a file format specific to the site.

User-agent: *
Disallow: /*. (File format: e.g. gif, jpg, etc.) $$.

The following set robots.txt problem

Robots.txt to plain text format txt file exists.
Robots.txt must be placed in the root directory of the website. The top level of the robots.txt file must be accessed.
Write robots.txt should be strictly according to the above case form transcription.
Usually your site is relatively simple, then the above format is enough for you to utilize. If it is relatively large, the need to visit here and unnecessary to visit there, to prevent this file and to allow that file, allow access to access with “?” Marked with “?” specific web pages and so on, then you need to unite the above format in detail to study the robots.txt file for your website writing.
Robots.txt usually in a subdirectory can also exist, but if there is a difference with the top directory in the robots.txt, then the top directory in the robots.txt prevail.
Only if your site includes content that you do not want to be included in the search engine, only to utilize the robots.txt file. If you want search engines to include all the content on the site, do not create robots.txt file, do not create a content for the empty robots.txt file. This is often overlooked, and creating an empty robots.txt file is actually very unfriendly to the search engines.
This format is not only to prevent the crawl page, more importantly, your site is included, and then the robots.txt file is modified to the following pattern, then your site will be deleted in the search engine, the entire deleted.

User-agent: *
Disallow: /

Meta-logos are optional for as-usual websites

<META NAME=“ROBOTS” CONTENT=“NOINDEX,NOFOLLOW”> (do not allow crawling of the page, do not allow following the chain on the page)
 
<META NAME=“ROBOTS” CONTENT=“INDEX,FOLLOW”> (allow the page to be crawled, and allow following the chain on the page to be crawled continuously)
 
<META NAME=“ROBOTS” CONTENT=“INDEX,NOFOLLOW”> (allow the page to be crawled, do not allow the chain on the page to be followed)
 
<META NAME=“ROBOTS” CONTENT=“NOINDEX,FOLLOW”> (allow the page to be crawled, allow the chain on the page to be crawled continuously)

Website Robots.txt Writing and Elaboration