-
Notifications
You must be signed in to change notification settings - Fork 6
SEO Experiment
Rough findings and thoughts (brain dump) as we add JSON-LD to Chaise pages with the intention of showing up in results of Google Dataset Search and Google Search.
- Adding JSON-LD to Chaise
- Experiment progress
- Experiment findings
- Future experiment (course of actions)
While we were adding JSON-LD to our pages, we faced some questions and issues. In the following you can find all the useful information related to that.
Sample case is here, we can see 3 URLs for the search result. This is because Google identified them as duplicates of the same dataset, all 3 pages have JSON-LD but a part of their JSON-LD references the other 2 pages like for example the first page mentions the other pages inside the distribution
tag of JSON-LD and that most likely helps Google identify them as duplicates. Google has of course a lot of more sophisticated techniques apart from this to cluster them.
-
It does not influence the indexing process as mentioned here and it is safe to use RMT for our intents and purposes.
-
It is not a required attribute and many top ranking data repositories do not add it at all like here
-
In some cases, the page did not add any date modified in the JSON-LD like here but Google parsed the page and set it on its own.
-
In many cases, the date is picked and shown as it is in the search result like here and in other cases, Google derives it on its own by parsing the page.
-
More details on how the date is derived by Google is here.
-
Google Dataset Search also has a filter on modified date to help users/researchers work with the most latest datasets, example here
-
Best practices to get the correct modified date in the results:
(1) Show when a page has been updated
(2) Use the right time zone
(3) Be consistent in usage
(4) Don’t use future dates or dates related to what a page is about
(5) Follow Google’s structured data guidelines
(6) Troubleshoot by minimizing other dates on the page as Google will pay attention to the modified date if its the only date/time on the page
What is the significance of isAccessibleForFree
?
Google has a filter called free, it allows users/researchers to limit themselves to freely available data. The datasets with isAccessibleForFree
set to true exclusively show up in the results.
Yes, we can supply multiple values for all of the attributes. There are only one way for this -
Supply array of values instead of a single value. For example, for keywords
you can add it as here https://search.google.com/test/rich-results?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset&id=e_Dxqr5DfkbJnzeL6Da26Q
There are various tools available to check this -
- Google Rich Results test
- Google Structured Data testing - marked as deprecated and replaced by 1)
- Schema.org validator - Only validates from a schema.org perspective, not taking Google's rules into consideration
Relevant article is here
Yes, example is this
In case of HTML, it is safe to stick to limited tags like anchor and paragraph
- The
name
attribute of JSON-LD has the table name hard-coded in it as of now, eventually it should come from templating by adding display name as a field there. - The
publisher
attribute should get its value from PI after support for wait-for is added. -
keywords
attribute inside Protocol "keywords": [ "Protocol", "kidney", "genitourinary", "{{{Title}}}", "{{{Keywords}}}", // eventually replace with psuedo-column from Protocol.Keyword "{{{Subjects}}}" // eventually replace with Protocol.Subject ],
There are two types of analytics that we can do,
-
Figure out how the users are using the site. For this, we can use Google's "search analytics" APIs or Google Data Studio. THIS REQUIRES MORE EXPLORATION
-
Find basic information about the sitemap and URLs on the website: https://developers.google.com/webmaster-tools/search-console-api-original/v3/quickstart/quickstart-python https://developers.google.com/webmaster-tools/search-console-api-original/v3/libraries#python
The first one could be useful but it requires more though about the type of analysis that we want to do, and the second one is not really useful. I couldn't find any APIs that allow us to get the index coverage report, and based on multiple sources that I saw (for example this website and this support page) it doesn't offer this at all.
In the following we summarized the order of events:
-
08/06/21
the "chaise" group of URLs were added in the chaise-sitemap.xml. -
09/17/21
we officially started the experiment by including all the URLs in the chaise-sitemap.xml. -
09/25/21
Google reported that the new sitemap was read in their search console. -
10/04/21
we noticed that the list of Collection RIDs that we used for SSR were wrong, and all the RIDs were actually in both Chaise and Dataset group as well. So we decided to generate the correct list, update the sitemap, and remove the wrong collections from production.At this time Google has actually crawled static site of some of the following RIDs: 16-QQG2, 16-QP8M, 16-QQG2, 16-WK64, 16-WHS4. And searching their titles in Google shows both static and non-static versions. Given that we already removed the static versions, we decided to wait and see whether will remove these from result or not.
-
11/02/21
we started to look into the google search console report and captured the indexed and duplicate URLs. We also looked into ways that we can automate this by writing scripts, but Google doesn't offer APIs for its index coverage report. -
11/04/21
moved all the raw data and analysis to the original google sheet page that can be access with this link.- It seems like as time goes by the number of indexed URLs are not necessarily increased. In this case we can see that 14 records that were marked as either indexed or marked, are now marked as not crawled/not indexed.
- Even though some of the Dataset URLs are indexed, Google is not reporting any of the datasets in search console. On the other hand inspecting them, or even searching in the dataset page works properly and shows that Google has seen the dataset:
-
example of inspecting a URL: https://search.google.com/u/0/search-console/inspect?resource_id=https%3A%2F%2Fwww.gudmap.org%2F&id=5C4g6y_GPiaMEUWiEO7dVA
-
The following are records that show up in the dataset result:
- Collection: 16-QKNG, 16-26EY, 16-WHS4, 16-E1WG
- Study: 16-DW7R
But Goole is reporting the following as indexed as well:
- Collection: 17-3ZTY, Q-3K5E
- Study: 16-DMRG, W-RAHW
Manually testing the urls in the Google's rich results test shows that sometimes Google is just seeing the not fully rendered Angular app and that might be the reason why it has failed to see the dataset tag.
- sometimes it can see the page correctly: https://search.google.com/test/rich-results/result?id=pBU3c6dbvrCnSoRB84DP3Q
- sometimes it can't: https://search.google.com/test/rich-results/result?id=1Ie9clX1bOKKosc_vka4tw&url=https%3A%2F%2Fwww.gudmap.org%2Fchaise%2Frecord%2F%3F2%2FCommon%3ACollection%2FRID%3DQ-3K5E
-
- I cannot explain this but there's a chaise page in the successfully indexed list that is not in sitemap: https://www.gudmap.org/chaise/record/?2/RNASeq:Experiment/RID=W-RB2W
-
11/16/21
captured the data. I realized that the number of successful/duplicate is just going lower and lower. For example this Specimen used to be indexed but it's not anymore. So I captured the data to have more data points to compare. -
11/29/21
noticed a big drop in the number of indexed URLs and decided to capture the URLs. -
12/11/21
Google reported 3 static sites as "soft 404" which means even though the response was 200 and a proper HTML response was returned, given its content Google treats it as a 404. Google is doing the same for pages that are throwing errors (for example this url where it will complain about the missing table name) which is desirable. After asking Google to reevaluate the pages, it dropped one page from this list. But it's still reporting the following as soft 404: -
01/04/22
the number of duplicates have significantly reduced (506 to 130), while the number of successfully indexed increased (215 to 318). Looking at the data it seems like most of the newly reported indexed are pages that used to be indexed and also in SSR category. -
02/02/22
captured the final set of results to conclude the experiment
-
The main SEO Experiment Google sheet: https://docs.google.com/spreadsheets/d/10lE-MXfgnCEfuyLrNkSVwlZTE7dtubo_jYFVjlXjTjc/edit#gid=958916140
-
Raw data exported from search console: https://docs.google.com/spreadsheets/d/1qjj0vfgmk74IjlmwsPuzM3XZmC-MEKBNvBKmIoudiNY/edit
-
A bunch of URLs in chaise are reported as duplicate even though they are referring to different RIDs. We might want to see whether adding canonical helps in this case or not. The canonical tag should point to the browser's location to ensure uniqueness.
- The canonical tag in statice sites are working. For example the following are ignored in favor of
https://gudmap.org
that is defined in the canonical tag:
- The canonical tag in statice sites are working. For example the following are ignored in favor of
-
Google is using smartphone bot as the default crawler. Would that hurt us?
- It seems like we don't have a choice in this matter. Google has switched to mobile-first crawler a while back, but it doesn't really mean that the desktop sites won't be crawled. Based on my understanding, Google will first try the mobilebot and then will us the desktopbot. So we should try to make our sites mobile-friendly for better/faster indexing.
-
In Chaise the title of the page can have three different values:
-
Record | RBK/GUDMAP resources
as soon as the HTML is fetched. -
<table-name>: pending... | RBK/GUDMAP resources
after we fetched the catalog/schema and waiting for the data to load. -
<table-name>: <row-name> | RBK/GUDMAP resources
after the data has been loaded.
So depending on when Google has actually captured the page, the title could be any of these values and we can actually see this in the results. Is there any way that we can fix this? Maybe only do this one time? or find best practices?
-
-
It seems like Google crawler has a concept of soft 404 where even though the response has a 200 status code, it will treat it as 404. This is useful since it's ignoring urls like
https://gudmap.org/chaise/recordset
, but it can have downside. For example we saw that it's treating hatrac XML files as soft 404. -
Looking at Google dataset results, it seems like it cannot properly handle our URLs. You can see that the "scholarly articles cite this dataset" link is just pointing to record app without any modifiers.
In this section, we summarize the plan of action for the future experiment.
- Separate sitemap for different categories to make the analysis simpler.
- Don't pollute the sitemap with a lot of datasets. The more datasets, the longer it take for google to crawl/index them.
- Limit the number of rows and make sure they are equal if our goal is to compare different type of datasets. If we have a lot datasets for a type the indexing budget will most probably be spend on those and nothing else.
- We should be able to have a good idea of how google did in about 2-3 months and after that it will just reindex. Although this is very subjective and depends on the number of urls in the page.
- Since resolver is redirecting to a different page, Google ignores them while indexing. So we should not include any resolver link in the navbar.
- Add
"includeCanonicalTag": true
. - chaise-config should not add
pcid
to links (not directly related to SEO)
- As was described in the previous section, we might be able to improve the way we're showing
title
in multiple steps. - The existing implementation of canonical tag uses the host that the URL is based on. So if it's GUDMAP, it will be GUDMAP, and if it's RBK, it will be RBK. This might not be desirable and we should think more about this.
Deriva-webapps showed up a lot in the reports. Mainly because Google is seeing each instance of deriva-webapps with different query parameters as a completely different location. This can potentially harm our search metrics because of the limited crawl/index budget. Therefore we should add canonical tags to deriva-webapps and make sure Google is ignoring query parameters that don't change the state of the app. The canonical tag logic would most probably be different for each app:
- We should add canonical tag (treeview can be invoked with query parameter and by adding a canonical tag we're ensuring that
https://www.gudmap.org/deriva-webapps/treeview
is the canonical version and all the other URLs should just be marked as duplicate). - When the app is working in a query parameter mode, we should add
noindex
meta tag. By doing this, we might be able to reduce the URLs that Google crawler sees and therefore increase its performance.
- Add canonical tag (to ignore the
pcid
and other unnecessary query params)
-
Make sure the url paths that we don't want to be indexed are listed as noindex in
robots.txt
.- We should investigate whether there's a way that we can selectively ask hatrac to add the
noindex
meta tag, or there's any other way that Chaise can do that. For example PDF hatrac files makes sense to be indexed, but a.bed
file should not be indexed.
- We should investigate whether there's a way that we can selectively ask hatrac to add the
-
A lot of the URLs listed in the duplicate were based on folder aliases that we have in GUDMAP/RBK (or at least they point to the same location). For example
docs
vs.gudmap-docs
. -
Using the "performance report" in Google search console we can get the list of best performing user queries. Given that, before starting the next experiment we should take a look at that list and see if there are any anatomy or page that is in there an we should therefore include the chaise page related to those in sitemap so they show up in result as well.
- ACLs In ERMrestJS and Chaise
- Facet Examples
- Facets JSON Structure
- Logging
- Model Annotation
- Model-based Logic and Heuristics
- Preformat Annotation Guide
- Export Annotation Guide
- Pseudo-Column Logic & Heuristics
- Table Alternatives
- Intro to Docker
- Chaise Dev Guide
- Dev Onboarding
- ERMrest 101
- ERMrest Howto
- ERMrestJS Dev Guide
- Extend Javascript Array
- Custom CSS guide
- Towards a style guide