The last couple of weeks we’ve been looking at information architecture with a focus on making your site more usable to real people. Today I want to continue that talk, except with the focus of how to make your site more usable to search engines.
If you remember I mentioned that how we structure content can help people better understand what your site is about and help them find what they want quicker and easier. The same is true in regards to search engines.
Today I want to talk about the later, about how search engines crawl and index your content and what you can do to help make it easier for them. Next week we’ll continue with a discussion of siloing or theming, which is a way to structure your content to help search engines better understand what it’s about and ideally help your content rank better for different keyword themes.
Depth of Site Structure and Search Engine Spiders
This weekend a friend and I went for a hike in one of the hundreds of hiking trails in the mountains where we live. There a are a limited number of starting points to discover those trails. You can start at one park and begin hiking one trail. When you see another trail off the one you’re on you can take the new trail or continue on the same one. The further you walk the more trails you’ll discover.
At some point you’ll likely get tired and come home. The next time you go for a hike you might start at the same park and that same trail and begin walking and exploring again. If you do you’ll encounter some of the same trails from your last walk as well as some new trails. Again you’ll tire at some point and go home likely having found a few more trails than you knew about after your first visit.
There’s more than one park in town, more than one starting point to find different trails. With each new starting point you’ll find new trails and sometimes even find old trails you discovered starting from other parks. Some trails in the mountains you’ll never find, but the more often you hike and you more places you start out from the more you’ll find.
How Search Spiders Crawl and Index Your Content
Search spiders or robots find content on the web in a very similar way to how we might discover new trails in the mountains except they follow links and what they find at the end of those links might look very different than the last time they were there.
You and I will start out on a trail we know about and from there discover new trails connected to the one we’re on. Search engines will start out on a page they know about and discover new pages that are connected to the ones they visit.
In much the same way you and I will discover more trails by varying where we start our hike, search engines find more pages by varying where they start. Also the same way you and I will tire and stop hiking for the day, search engines spiders won’t continue crawling forever at any one time. There’s a limit to how deep they’ll crawl a site on any visit.
If we think about the above we’re left with 3 ways a search engine might crawl and index more of your site.
- More entry points
- A deeper crawl
- Pages closer to the most common entry points
The first two above are a function of the links pointing into your site. We create more entry points by having other sites link to as many pages of our sites as possible. If they all link to the home page of our site we have one entry point. If they link to a variety of pages across our site we have many entry points.
More links flowing into your site generally means your site has more link equity. For Google that link equity is pagerank (PR) and Google has said that the more PR, the more link equity a site has, the deeper it will crawl that site.
The last item above is where information architecture comes in. The closer we can make a page to one of the starting points of a crawl the more likely that page will be found and consequently indexed. A shallower structure becomes a goal to increasing the number of pages indexed on your site.
We learned a few weeks ago about the principle of choices, the idea that the more options we provide, say in a menu, the harder it is for people to choose one of the options. The principle of choices pushes us toward a deeper structure. It wants us to create top level navigation with fewer links, since that’s easier for real people. It pushes pages away from those starting point pages.
Using Sitemaps To Speed Indexing
You’ve likely heard about sitemaps. Sitemaps offer a solution to the above problem of wanting both a deeper and shallower content structure. There are two kinds of sitemaps.
- html sitemaps are page(s) on your site that link to all the other pages on your site.
- xml sitemap are files you submit to search engines to tell them about all the pages on your site.
An xml sitemap is really a backup plan. There’s no guarantee a search engine will crawl all the links in your xml sitemap. Think of them as a supplement to a good site structure. Google says as much on their About Sitemaps page.
Google doesn’t guarantee that we’ll crawl or index all of your URLs. However, we use the data in your Sitemap to learn about your site’s structure, which will allow us to improve our crawler schedule and do a better job crawling your site in the future.
I generally don’t use an xml sitemap and I’ve never had a problem with search engines discovering my pages. That’s not to say you shouldn’t create and submit an xml sitemap, but rather that if your site is structured well to allow crawling the xml sitemap isn’t really necessary.
An html sitemap is one you create on your site. It’s a page like any other that links to all your other pages. If you place a link to your html sitemap on every page of your site, say in the footer, then your html sitemap is never more than one click away from any page on your site. Since that sitemap links to every other page on your site, it also means no page is ever more than two clicks away from any other page on your site.
As your site grows it becomes more difficult to link to every page from a single page so your sitemap begins to have its own structure. Perhaps your top level sitemap page links to is several additional sitemaps (one for each section of your site) that then link directly to your pages. Each page is now never more than 3 clicks away from any other page.
In other words you can have a site structure as deep as you want and through a sitemap you link to from every page, leave a shallow structure for search engines to find your content.
Before leaving sitemaps I want to mention video sitemaps. These are xml sitemaps specific to your video content. Because the content inside a video can’t be crawled they’re probably more useful than regular xml sitemaps. The short video below will tell you what Google wants to see in your video sitemap. Again though, if you can include all those things directly on your site you probably don’t need to submit a video sitemap.
A couple of times above I’ve mentioned that I don’t think xml sitemaps, including video sitemaps, are necessary. That doesn’t mean you shouldn’t use them. They certainly aren’t going to hurt and they may very well help search engines find pages that are difficult for them to find during a normal crawl.
What I want you to understand is that xml sitemaps are a supplement, not a replacement for a good site structure. Better to have search engines find your content by making it easy for them to get to that page following links on your site than to rely on them following the xml you submit.
SEOmoz Whiteboard Friday – Dealing with Duplicate Content from Scott Willoughby on Vimeo.
Eliminate Duplicate Content
Googlebot “works on a budget”: if you keep it busy crawling huge files or waiting for your page to load or following duplicate content URLs, you might be missing the chance to show it your other pages.
Duplicate content might be two completely different pages with the same content or it might be the same page accessed through 2 different URLs. The latter happens a lot with content management systems, where there aren’t pages of content, but rather code that determines which content to pull from the database, depending on different conditions.
It’s very possible the same content can be accessed by going through your main navigation or through a tag cloud or through internal site search. The URLs might end up looking like:
The resulting page is the same in all 3 cases, but to a search engine it’s 3 different pages. You only want one of those URLs crawled and indexed. If all 3 are indexed you’re competing with yourself for the same traffic, which ends up leading to less overall traffic. You also leave it up to the search engines to determine which is the best page (URL) to show.
And if you allow search engines to crawl all 3 URLs it may take them longer to find the one you want while they’re crawling the one you don’t want.
There are a variety of solutions to the above.
- Meta information like noindex and nofollow
- Canonical tags
- 301 redirects
- Robots.txt to block crawling of certain pages
Each of the above is worthy of one or more posts on its own so instead of trying to give you all the details here I’ll offer some resources for more information below.
The main thing I want you to understand from this section is that you need to be aware of the structure of your site and how many different ways (URLs) there are to access the same content. Realize that while you and your visitors understand they’re looking at the same page, search engines don’t and you need to help them understand a little more.
- Specify your canonical
- Learn about the Canonical Link Element in 5 minutes
- Learn More about the Canonical Link Element
- Google, Yahoo & Microsoft Unite On “Canonical Tag” To Reduce Duplicate Content Clutter
- Canonical URL Tag – The Most Important Advancement in SEO Practices Since Sitemaps
- Dispelling a Persistent Rel Canonical Myth
- Canonical URL’s for WordPress
- Canonical URL links
- [When NOT To Use Canonical URL Links ]
- A Standard for Robot Exclusion
- Get yourself a smart robots.txt
- Learn more about robots.txt
- Robots.txt from SEOmoz Knowledgebase
- URL Rewrites & Redirects: The Gory Details (Part 1 of 2)
- URL Rewrites & Redirects: The Gory Details (Part 2 of 2)
- Guide to Applying 301 Redirects with Apache
The anatomy of a server sided redirect: 301, 302 and 307 illuminated SEO wise
- Using htaccess Files for Pretty URLS
The way you structure your content plays a part in how well your content gets crawled and indexed. If you want a search engine to list one of your pages in their results, the search engine first needs to find that page. It’s important that we make it easier for spiders to find all of the pages we want indexed.
Fortunately most of the ways you help search engines find your content also helps real people find that same content. A sitemap for example can serve as a great backup to your main navigation and can be organized in a way that makes it a table of contents for your entire site. Shorter click paths mean people as well as spiders can get to your content quicker.
Sometimes though, we need to understand the difference in how people and search engines see things. Real people won’t have any problem with multiple URLs pointing to the same content. If anything it likely makes it easier for them. Search engines on the other hand still get confused by “duplicate content” and you need to be aware of that so you can help make things clearer for them.
Next week we’ll look beyond crawling and indexing and talk about siloing or theming your content. The idea is to develop the structure of your content in a way to help reinforce the different keyword themes on your site and in the process help your pages rank better for keyword phrases around those themes.
Download a free sample from my book, Design Fundamentals.
Wow, cool post that I ever read about Search Engines.
Thanks for Great post,
I am a designer and I don’t want to read RSS so I love Twitter, Please give your Twitter Id(better put it in this sit).
Thanks subbu. I’m @vangogh on Twitter, though in truth I haven’t been tweeting much lately. Hopefully more in the future.
Great article Steven..
Every webmaster must read this article:)
Thanks Arshad. I’m not sure everyone must read this post, but it certainly wouldn’t hurt if they did.
Some very valid points here! thanks for the share
Excellent article! Just like I would advise someone looking to start an online auction site to look at DubLi and eBay, so I would recommend those looking to increase their web exposure to look at this article. Bravo!
mohaaa! very detailed post. thank you very much.
Well, that’s in depth. I’m a fan of XML sitemaps myself – I know you say they aren’t necessary, but they do help.
Andy out of curiosity how have you noticed them helping? I’m not saying they don’t, just looking to understand more about how they might help. From what I’ve seen I would think as long as spiders can crawl and index your html you’re fine. Of course if xml sitemaps can help even more I’ll be happy to start using them more.
This article is fantastic source of information for all Online Marketers. Many online marketers dont know about the strategies of Google crawl. I hope after reading this, they will have some idea about it.
Thanks Sarah. Hopefully online marketers do understand something about how search engines crawl web pages, though I’m sure there are those who don’t.
kind of treasure! yes this article deserve to be bookmarked. thanks for such a amazing article.
Thank you for sharing this detailed explanation of what google is looking for. It is very helpful to be given specific information. So does wordpress with its pages and then archives contribute to duplicate content as seen by the search engine?
I look forward to reading about theming content. Thanks
Thanks Ros. Yes WordPress can output duplicate content. Many of the seo plugins will offer options to add noindex to some of the different post types to keep them from being indexed.
A good plugin is Joost’s robots meta plugin. He also has a great write up about WordPress SEO. The link will take you specifically to the part about archive pages. Scroll back to the top when you’re done with that part though. The whole article is a really good read.
Great info (as usual) Steve. Your point “xml sitemaps are a supplement, not a replacement for a good site structure” is right on the money! This post should be required reading for both web developers and clients planning on building a web site. An ounce of prevention?
Thanks Dave. Yeah I’ve still never seen evidence that an XML sitemap will do anything extra if you’re site structure is in order. Not that they don’t have uses, but I think people see them for something more than they are.
amazing source of information…I have just created two website and this info is very precious to me….thanks
This is a superb post How To Help Search Engines Find Your Content .
Resources like the one you mentioned here will be very useful to me! I will post a link to this page on my blog. I am sure my visitors will find that very useful.
Great article, I didn’t really understand why a site map is so important; but with your explanation of SE spiders and how to make it easy on their gathering information from my website really put it into perspective. Thanks for the SEO help!
Thanks. I’m glad I was able to help you understand site maps more.
This was a really helpful article. Thanks for the valuable info. I’m going to start using these immediately.
Site structure seems to be a bigger focus when you have an e-commerce style site than if you have a blog, for example. I think that’s largely because WordPress has decent site structure naturally, but I’m not entirely sure.
I’m not sure if the focus has to be any more in ecommerce sites, unless you mean it needs to be there since it’s not there by default. Site structure is definitely an important consideration in any site.
I think you’re right about WordPress. Assuming you set up categories well and perhaps tags too, the software does a good job structuring the site by default. It usually needs a few tweaks, but WordPress does a good job out of the box.
I’m seeing more and more Google + author posts in the top 10 of lots of search phrases. I think Google is giving less relevant results higher priority just for the simple fact they use their Google + page as an author page within the html. Although unfair to a certain degree, it does give hard workers that develop good content a leg up on their competition
Are you signed into something Google when you notice this? If you’re signed into Google then there probably is a small boost to pages by Google+ authors you’re connected to.
It could also depend on what you’re searching. If it’s a topic where many would be tech and marketing savvy those people will have set up the necessary parts to have their author image display. The results themselves may not be changing so much, but those who ranked well were savvy enough to add the author information.
Really good information, thanks, what did you think about subdomains?, in which cases should we use them?, or we should try to avoid them?
As far as search engines are concerned I don’t think it matters much. I generally prefer to use subfolders, because I think it’s easier for real people to remember.
Most of us don’t understand how this works. Thanks for the article.