Skip to content

SEO Experiment

Aref Shafaei edited this page Apr 6, 2022 · 38 revisions

Rough findings and thoughts (brain dump) as we add JSON-LD to Chaise pages with the intention of showing up in results of Google Dataset Search and Google Search.

Table of Contents

Adding JSON-LD to Chaise

While we were adding JSON-LD to our pages, we faced some questions and issues. In the following you can find all the useful information related to that.

Why do some Google Dataset search results have multiple URLs?

Sample case is here, we can see 3 URLs for the search result. This is because Google identified them as duplicates of the same dataset, all 3 pages have JSON-LD but a part of their JSON-LD references the other 2 pages like for example the first page mentions the other pages inside the distribution tag of JSON-LD and that most likely helps Google identify them as duplicates. Google has of course a lot of more sophisticated techniques apart from this to cluster them.

What is the significance of dateModified attribute in JSON-LD?

  • It does not influence the indexing process as mentioned here and it is safe to use RMT for our intents and purposes.

  • It is not a required attribute and many top ranking data repositories do not add it at all like here

  • In some cases, the page did not add any date modified in the JSON-LD like here but Google parsed the page and set it on its own.

  • In many cases, the date is picked and shown as it is in the search result like here and in other cases, Google derives it on its own by parsing the page.

image
  • More details on how the date is derived by Google is here.

  • Google Dataset Search also has a filter on modified date to help users/researchers work with the most latest datasets, example here

  • Best practices to get the correct modified date in the results:

    (1) Show when a page has been updated

    (2) Use the right time zone

    (3) Be consistent in usage

    (4) Don’t use future dates or dates related to what a page is about

    (5) Follow Google’s structured data guidelines

    (6) Troubleshoot by minimizing other dates on the page as Google will pay attention to the modified date if its the only date/time on the page

What is the significance of isAccessibleForFree ?

Google has a filter called free, it allows users/researchers to limit themselves to freely available data. The datasets with isAccessibleForFree set to true exclusively show up in the results.

Can we have multiple values for the schema.org attributes?

Yes, we can supply multiple values for all of the attributes. There are only one way for this - Supply array of values instead of a single value. For example, for keywords you can add it as here https://search.google.com/test/rich-results?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset&id=e_Dxqr5DfkbJnzeL6Da26Q

How can we check the validness of a generated JSON-LD?

There are various tools available to check this -

  1. Google Rich Results test
  2. Google Structured Data testing - marked as deprecated and replaced by 1)
  3. Schema.org validator - Only validates from a schema.org perspective, not taking Google's rules into consideration

How does Google Dataset Search work under the hood (in short)?

image

Relevant article is here

Is Markdown or HTML supported in description?

Yes, example is this

In case of HTML, it is safe to stick to limited tags like anchor and paragraph

Pending items:

  1. The name attribute of JSON-LD has the table name hard-coded in it as of now, eventually it should come from templating by adding display name as a field there.
  2. The publisher attribute should get its value from PI after support for wait-for is added.
  3. keywords attribute inside Protocol "keywords": [ "Protocol", "kidney", "genitourinary", "{{{Title}}}", "{{{Keywords}}}", // eventually replace with psuedo-column from Protocol.Keyword "{{{Subjects}}}" // eventually replace with Protocol.Subject ],

Google Search Console API

There are two types of analytics that we can do,

The first one could be useful but it requires more though about the type of analysis that we want to do, and the second one is not really useful. I couldn't find any APIs that allow us to get the index coverage report, and based on multiple sources that I saw (for example this website and this support page) it doesn't offer this at all.

Experiment progress

In the following we summarized the order of events:

  • 08/06/21 the "chaise" group of URLs were added in the chaise-sitemap.xml.

  • 09/17/21 we officially started the experiment by including all the URLs in the chaise-sitemap.xml.

  • 09/25/21 Google reported that the new sitemap was read in their search console.

  • 10/04/21 we noticed that the list of Collection RIDs that we used for SSR were wrong, and all the RIDs were actually in both Chaise and Dataset group as well. So we decided to generate the correct list, update the sitemap, and remove the wrong collections from production.

    At this time Google has actually crawled static site of some of the following RIDs: 16-QQG2, 16-QP8M, 16-QQG2, 16-WK64, 16-WHS4. And searching their titles in Google shows both static and non-static versions. Given that we already removed the static versions, we decided to wait and see whether will remove these from result or not.

  • 11/02/21 we started to look into the google search console report and captured the indexed and duplicate URLs. We also looked into ways that we can automate this by writing scripts, but Google doesn't offer APIs for its index coverage report.

  • 11/04/21 moved all the raw data and analysis to the original google sheet page that can be access with this link.

  • 11/16/21 captured the data. I realized that the number of successful/duplicate is just going lower and lower. For example this Specimen used to be indexed but it's not anymore. So I captured the data to have more data points to compare.

  • 11/29/21 noticed a big drop in the number of indexed URLs and decided to capture the URLs.

  • 12/11/21 Google reported 3 static sites as "soft 404" which means even though the response was 200 and a proper HTML response was returned, given its content Google treats it as a 404. Google is doing the same for pages that are throwing errors (for example this url where it will complain about the missing table name) which is desirable. After asking Google to reevaluate the pages, it dropped one page from this list. But it's still reporting the following as soft 404:

  • 01/04/22 the number of duplicates have significantly reduced (506 to 130), while the number of successfully indexed increased (215 to 318). Looking at the data it seems like most of the newly reported indexed are pages that used to be indexed and also in SSR category.

  • 02/02/22 captured the final set of results to conclude the experiment

Experiment findings

  • The main SEO Experiment Google sheet: https://docs.google.com/spreadsheets/d/10lE-MXfgnCEfuyLrNkSVwlZTE7dtubo_jYFVjlXjTjc/edit#gid=958916140

  • Raw data exported from search console: https://docs.google.com/spreadsheets/d/1qjj0vfgmk74IjlmwsPuzM3XZmC-MEKBNvBKmIoudiNY/edit

  • A bunch of URLs in chaise are reported as duplicate even though they are referring to different RIDs. We might want to see whether adding canonical helps in this case or not. The canonical tag should point to the browser's location to ensure uniqueness.

  • Google is using smartphone bot as the default crawler. Would that hurt us?

    • It seems like we don't have a choice in this matter. Google has switched to mobile-first crawler a while back, but it doesn't really mean that the desktop sites won't be crawled. Based on my understanding, Google will first try the mobilebot and then will us the desktopbot. So we should try to make our sites mobile-friendly for better/faster indexing.
  • In Chaise the title of the page can have three different values:

    • Record | RBK/GUDMAP resources as soon as the HTML is fetched.
    • <table-name>: pending... | RBK/GUDMAP resources after we fetched the catalog/schema and waiting for the data to load.
    • <table-name>: <row-name> | RBK/GUDMAP resources after the data has been loaded.

    So depending on when Google has actually captured the page, the title could be any of these values and we can actually see this in the results. Is there any way that we can fix this? Maybe only do this one time? or find best practices?

  • It seems like Google crawler has a concept of soft 404 where even though the response has a 200 status code, it will treat it as 404. This is useful since it's ignoring urls like https://gudmap.org/chaise/recordset, but it can have downside. For example we saw that it's treating hatrac XML files as soft 404.

  • Looking at Google dataset results, it seems like it cannot properly handle our URLs. You can see that the "scholarly articles cite this dataset" link is just pointing to record app without any modifiers.

Future experiment (course of actions)

In this section, we summarize the plan of action for the future experiment.

Overall

  • Separate sitemap for different categories to make the analysis simpler.
  • Don't pollute the sitemap with a lot of datasets. The more datasets, the longer it take for google to crawl/index them.
  • Limit the number of rows and make sure they are equal if our goal is to compare different type of datasets. If we have a lot datasets for a type the indexing budget will most probably be spend on those and nothing else.
  • We should be able to have a good idea of how google did in about 2-3 months and after that it will just reindex. Although this is very subjective and depends on the number of urls in the page.

Chaise-config

  • Since resolver is redirecting to a different page, Google ignores them while indexing. So we should not include any resolver link in the navbar.
  • Add "includeCanonicalTag": true.
  • chaise-config should not add pcid to links (not directly related to SEO)

Chaise

  • As was described in the previous section, we might be able to improve the way we're showing title in multiple steps.
  • The existing implementation of canonical tag uses the host that the URL is based on. So if it's GUDMAP, it will be GUDMAP, and if it's RBK, it will be RBK. This might not be desirable and we should think more about this.

Deriva webapps

Deriva-webapps showed up a lot in the reports. Mainly because Google is seeing each instance of deriva-webapps with different query parameters as a completely different location. This can potentially harm our search metrics because of the limited crawl/index budget. Therefore we should add canonical tags to deriva-webapps and make sure Google is ignoring query parameters that don't change the state of the app. The canonical tag logic would most probably be different for each app:

Treeview

  • We should add canonical tag (treeview can be invoked with query parameter and by adding a canonical tag we're ensuring that https://www.gudmap.org/deriva-webapps/treeview is the canonical version and all the other URLs should just be marked as duplicate).
  • When the app is working in a query parameter mode, we should add noindex meta tag. By doing this, we might be able to reduce the URLs that Google crawler sees and therefore increase its performance.

Other apps

  • Add canonical tag (to ignore the pcid and other unnecessary query params)

Other

  • Make sure the url paths that we don't want to be indexed are listed as noindex in robots.txt.

    • We should investigate whether there's a way that we can selectively ask hatrac to add the noindex meta tag, or there's any other way that Chaise can do that. For example PDF hatrac files makes sense to be indexed, but a .bed file should not be indexed.
  • A lot of the URLs listed in the duplicate were based on folder aliases that we have in GUDMAP/RBK (or at least they point to the same location). For example docs vs. gudmap-docs.

  • Using the "performance report" in Google search console we can get the list of best performing user queries. Given that, before starting the next experiment we should take a look at that list and see if there are any anatomy or page that is in there an we should therefore include the chaise page related to those in sitemap so they show up in result as well.

Clone this wiki locally