Skip to content
Karishma Pherwani edited this page Jun 3, 2021 · 45 revisions

Todos

Structured Data

This supports Google Dataset Search but may also help Google understand our pages in general; from https://developers.google.com/search/docs/guides/intro-structured-data:

Google uses structured data that it finds on the web to understand the content of the page, as well as to gather information about the web and the world in general.

I've (Laura) created a github issue about adding an annotation to add structured data to our record pages; there are some open questions, so I'd appreciate people reading and commenting on it. This will also enable Google Dataset Search for those pages.

Canonical Tags

This is what Google says about when to use canonical URLs:

  • To specify which URL that you want people to see in search results. You might prefer people reach your green dresses product page via https://www.example.com/dresses/green/greendress.html rather than https://example.com/dresses/cocktail?gclid=ABCD.
  • To consolidate link signals for similar or duplicate pages. It helps search engines to be able to consolidate the information they have for the individual URLs (such as links to them) into a single, preferred URL. This means that links from other sites to http://example.com/dresses/cocktail?gclid=ABCD get consolidated with links to https://www.example.com/dresses/green/greendress.html.
  • To simplify tracking metrics for a single product/topic. With a variety of URLs, it's more challenging to get consolidated metrics for a specific piece of content.
  • To manage syndicated content. If you syndicate your content for publication on other domains, you want to consolidate page ranking to your preferred URL.
  • To avoid spending crawling time on duplicate pages. You want Googlebot to get the most out of your site, so it's better for it to to spend time crawling new (or updated) pages on your site, rather than crawling the desktop and mobile versions of the same pages.

Note: from @lpearlman: it's all about not indexing the "duplicates". I do think that making a domain canonical is a good idea, but that can be done once per site in the search console. or, actually, if we're already doing 301s, that will take care of it too.

References

Making Datasets Discoverable

Why work on Google Dataset Search?

Google Dataset Search relies on exposed crawlable structured data via schema.org markup, using the schema.org dataset class. It is a search engine over metadata. Working on making our pages crawlable by Google Dataset will not necessarily improve results in Google Search as they are not entirely related. It's mostly intended to simplify the research process for users. Results can be filtered based on the type of dataset required, such as tables, images or text, or on whether the dataset is free to use.

https://www.mikegingerich.com/blog/how-google-dataset-search-can-benefit-your-marketing-plan/

Unfortunately, the usage stats of Google Dataset are not available so we don't know who and how many people actually use it everyday.

The following links talk about how to expose datasets and required structure that the web crawler requires the webpages/datasets to be described as :

Links to refer

Current visibility -

Our pages that are already registered with Datacite and show up in DataCite Search example - https://search.datacite.org/works?query=Prostate+epithelial-specific+expression+of+activated+PI3K ALSO show up in Google Dataset search - https://datasetsearch.research.google.com/search?query=Prostate%20epithelial-specific%20expression%20of%20activated%20PI3K&docid=L2cvMTFqOWJfdGNzNg%3D%3D

It shows up under the name and banner of DataCite, though the DOI link routes to GUDMAP.

Google's rich results test shows the valid dataset from Datacite and highlights missing fields - https://search.google.com/test/rich-results?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset&id=u5nHSzxhLFCnjuJsGuHV5w

Findings of Google Dataset investigation -

• Benefit of incorporating JSON-LD ourselves is that its more legible and anybody can read and crawl the metadata as it is embedded in HTML, we can easily see metadata generated for the page using the dev tools and no calls to external services like DataCite is needed (no external dependency) Example - https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/

image

Our pages would show up in the result as opposed to Datacite showing up in the result as would be the case here -

• The alternative way suggested here is to use Datacite - https://github.com/informatics-isi-edu/chaise/issues/1801

As per the report below, portals like Kaggle and Figshare show up in results more often – “Pick popular portals to publish than personal ones if the publisher is not consistent. For instance, sites like Kaggle will show up in results more than a one time published portal. The report stated that data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are an excellent way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.”

Report - https://analyticsindiamag.com/google-dataset-search-analysis-report/

Paper - https://research.google/pubs/pub49385/

Which properties to include in the metadata?

Required properties - name, description https://developers.google.com/search/docs/data-types/dataset#dataset

Stats of which popular properties are generally included in the metadata (from Google’s paper above) image

List and detailed description of all properties - https://schema.org/Dataset

Sample structure -

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "#DOI link",
  "name": "#A descriptive name of a dataset",
  "description": "#A short summary describing a dataset\
                  MANDATORY ATTRIBUTES END HERE, REST ARE GOOD TO HAVE",

  "spatialCoverage": "#Only include this property if the dataset has a spatial dimension. \
                      For example, a single point where all the measurements were collected, or the coordinates of a bounding box for an area. \
                      It should be Text or Place",
  "temporalCoverage": "#The data in the dataset covers a specific time interval e.g. 1950-01-01/2013-12-18 or 2008",
  "identifier": "#Use the identifier property to attach any relevant Digital Object identifiers (DOIs) or Compact Identifiers. \
                 If the dataset has more than one identifier, repeat the identifier property or use an array. ",
  "license": "#A license under which the dataset is distributed",
  "isAccessibleForFree": "#Boolean to indicate whether the dataset is accessible for free",
  "sourceOrganization": "#The Organization on whose behalf the creator was working",
  "dateCreated": "#The date on which the CreativeWork was created or the item was added to a DataFeed",
  "datePublished": "#Date of first broadcast/publication",
  "dateModified": "#Date of last update",

  "publisher": {
     "@type":"Organization",
     "name":"#Name of publisher"
  },
  "provider": {
     "@type":"Organization",
     "name":"#Name of provider"
  },
  "funder": {
     "@type":"Organization",
     "name":"#Name of funder"
  },

  "creator": {
    "@type": "Person",
    "sameAs": "#To uniquely identify individuals, use ORCID ID as the value",
    "givenName": "#First name",
    "familyName": "#Last name",
    "name": "#Name"
  },
  "keywords": "#Keywords summarizing the dataset",
  "url": "#Location of a page describing the dataset",
  "sameAs": "The URL of a reference web page that unambiguously indicates the dataset's identity",
  "variableMeasured": "#The variable that this dataset measures. For example, temperature or pressure",
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "#Datasets are often published in repositories that contain many other datasets. \
             The same dataset can be included in more than one such repository. \
             You can refer to a data catalog that this dataset belongs to by referencing it directly - https://schema.org/DataCatalog"
  },
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "#The link for the download"
  },
}
</script>

Example - https://developers.google.com/search/docs/data-types/dataset#example

Sample json-ld

URL - https://www.rebuildingakidney.org/chaise/record/#2/Common:Collection/RID=Q-3K5C

Datacite generated json-ld - https://api.datacite.org/dois/application/ld+json/10.25548/s5re-nvce

Google Dataset search result - https://datasetsearch.research.google.com/search?query=Histology%20of%20Human%20and%20Mouse%20embryonic%20and%20fetal%20kidney&docid=L2cvMTFqOWI3bDIwNQ%3D%3D

Below is the json-ld with the potential variable to map from and the value mentioned

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "{{{Persistent_ID}}} - https://doi.org/10.25548/s5re-nvce",
  "name": "{{{Title}}} - Histology of Human and Mouse embryonic and fetal kidney",
  "description": "{{{Description}}} - A collection of histological sections of the developing human (CS13-week 23) and mouse (E11.5-P2) kidney. ",
  "keywords": "#Important attribute should have value like here https://api.datacite.org/dois/application/ld+json/10.25548/s5re-nvce, value and source pending",
  "url": "{{{window.location}}} - https://www.rebuildingakidney.org/chaise/record/#2/Common:Collection/RID=Q-3K5C",
  "datePublished": "{{{Release_Date}}} - 2017-09-07 15:01:21",
  "dateModified": "{{{RMT}}} - 2017-12-07 15:29:22",
  "variablesMeasured":"??",

  "publisher": {
    "@type":"Organization",
    "name":"{{{keyValues.Consortium}}} - GUDMAP"
  },
  "provider": {
    "@type":"Organization",
    "name":"{{{keyValues.$fkey_Common_Collection_Data_Provider_fkey.values.Name}}} - University of Southern California"
  },
  "creator":{
    "@type":"Person",
    "name":"{{{keyValues.$fkey_Common_Collection_Principal_Investigator_fkey.values.Full_Name}}} - Andrew McMahon"
  },

  "license": "??",
  "isAccessibleForFree": true,
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "{{{window.location.hostname}}} - www.rebuildingakidney.org"
  }
}
</script>

• Another way suggested by a Google representative to inform search engines about our pages is using https://www.sitemaps.org to list all our datasets. It’s not a mandatory requirement but allows crawlers to find our datasets more easily.

https://developers.google.com/search/docs/advanced/sitemaps/overview

How to test?

The way to test used to be the structured data testing tool but it’s now on the way to deprecation (so not recommended) - https://search.google.com/structured-data/testing-tool/u/0/?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset

The new alternative is the rich results tool. – https://developers.google.com/search/docs/guides/prototype https://search.google.com/test/rich-results

Example result - https://search.google.com/test/rich-results?id=e2vghHBltBW3iGCZjYsqXw

Clone this wiki locally