Skip to content
Karishma Pherwani edited this page May 24, 2021 · 45 revisions

Todos

Structured Data

This supports Google Dataset Search but may also help Google understand our pages in general; from https://developers.google.com/search/docs/guides/intro-structured-data:

Google uses structured data that it finds on the web to understand the content of the page, as well as to gather information about the web and the world in general.

I've (Laura) created a github issue about adding an annotation to add structured data to our record pages; there are some open questions, so I'd appreciate people reading and commenting on it. This will also enable Google Dataset Search for those pages.

Canonical Tags

This is what Google says about when to use canonical URLs:

  • To specify which URL that you want people to see in search results. You might prefer people reach your green dresses product page via https://www.example.com/dresses/green/greendress.html rather than https://example.com/dresses/cocktail?gclid=ABCD.
  • To consolidate link signals for similar or duplicate pages. It helps search engines to be able to consolidate the information they have for the individual URLs (such as links to them) into a single, preferred URL. This means that links from other sites to http://example.com/dresses/cocktail?gclid=ABCD get consolidated with links to https://www.example.com/dresses/green/greendress.html.
  • To simplify tracking metrics for a single product/topic. With a variety of URLs, it's more challenging to get consolidated metrics for a specific piece of content.
  • To manage syndicated content. If you syndicate your content for publication on other domains, you want to consolidate page ranking to your preferred URL.
  • To avoid spending crawling time on duplicate pages. You want Googlebot to get the most out of your site, so it's better for it to to spend time crawling new (or updated) pages on your site, rather than crawling the desktop and mobile versions of the same pages.

Note: from @lpearlman: it's all about not indexing the "duplicates". I do think that making a domain canonical is a good idea, but that can be done once per site in the search console. or, actually, if we're already doing 301s, that will take care of it too.

References

Making Datasets Discoverable

Why work on Google Dataset Search?

Google Dataset Search relies on exposed crawlable structured data via schema.org markup, using the schema.org dataset class. It is a search engine over metadata. Working on making our pages crawlable by Google Dataset will not necessarily improve results in Google Search as they are not entirely related. It's mostly intended to simplify the research process for users. Results can be filtered based on the type of dataset required, such as tables, images or text, or on whether the dataset is free to use.

https://www.mikegingerich.com/blog/how-google-dataset-search-can-benefit-your-marketing-plan/

The following links talk about how to expose datasets and required structure that the web crawler requires the webpages/datasets to be described as :

Links to refer

Current visibility -

Our pages that are already registered with Datacite and show up in DataCite Search example - https://search.datacite.org/works?query=Prostate+epithelial-specific+expression+of+activated+PI3K ALSO show up in Google Dataset search - https://datasetsearch.research.google.com/search?query=Prostate%20epithelial-specific%20expression%20of%20activated%20PI3K&docid=L2cvMTFqOWJfdGNzNg%3D%3D

It shows up under the name and banner of DataCite, though the DOI link routes to GUDMAP.

Google's rich results test shows the valid dataset from Datacite and highlights missing fields - https://search.google.com/test/rich-results?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset&id=u5nHSzxhLFCnjuJsGuHV5w

Findings of Google Dataset investigation -

• Benefit of incorporating JSON-LD ourselves is that its more legible and anybody can read and crawl the metadata as it is embedded in HTML, we can easily see metadata generated for the page using the dev tools and no calls to external services like DataCite is needed (no external dependency) Example - https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/

image

Our pages would show up in the result as opposed to Datacite showing up in the result as would be the case here -

• The alternative way suggested here is to use Datacite - https://github.com/informatics-isi-edu/chaise/issues/1801

As per the report below, portals like Kaggle and Figshare show up in results more often – “Pick popular portals to publish than personal ones if the publisher is not consistent. For instance, sites like Kaggle will show up in results more than a one time published portal. The report stated that data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are an excellent way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.”

Report - https://analyticsindiamag.com/google-dataset-search-analysis-report/

Paper - https://research.google/pubs/pub49385/

Which properties to include in the metadata?

Required properties - name, description https://developers.google.com/search/docs/data-types/dataset#dataset

Stats of which popular properties are generally included in the metadata (from Google’s paper above) image

List and detailed description of all properties - https://schema.org/Dataset

• Another way suggested by a Google representative to inform search engines about our pages is using https://www.sitemaps.org to list all our datasets. It’s not a mandatory requirement but allows crawlers to find our datasets more easily.

https://developers.google.com/search/docs/advanced/sitemaps/overview

How to test?

The way to test used to be the structured data testing tool but it’s now on the way to deprecation (so not recommended) - https://search.google.com/structured-data/testing-tool/u/0/?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset

The new alternative is the rich results tool. – https://developers.google.com/search/docs/guides/prototype https://search.google.com/test/rich-results

Example result - https://search.google.com/test/rich-results?id=e2vghHBltBW3iGCZjYsqXw

Clone this wiki locally