Skip to content
Karishma Pherwani edited this page Jul 22, 2021 · 45 revisions

Todos

Structured Data

This supports Google Dataset Search but may also help Google understand our pages in general; from https://developers.google.com/search/docs/guides/intro-structured-data:

Google uses structured data that it finds on the web to understand the content of the page, as well as to gather information about the web and the world in general.

I've (Laura) created a github issue about adding an annotation to add structured data to our record pages; there are some open questions, so I'd appreciate people reading and commenting on it. This will also enable Google Dataset Search for those pages.

Canonical Tags

This is what Google says about when to use canonical URLs:

  • To specify which URL that you want people to see in search results. You might prefer people reach your green dresses product page via https://www.example.com/dresses/green/greendress.html rather than https://example.com/dresses/cocktail?gclid=ABCD.
  • To consolidate link signals for similar or duplicate pages. It helps search engines to be able to consolidate the information they have for the individual URLs (such as links to them) into a single, preferred URL. This means that links from other sites to http://example.com/dresses/cocktail?gclid=ABCD get consolidated with links to https://www.example.com/dresses/green/greendress.html.
  • To simplify tracking metrics for a single product/topic. With a variety of URLs, it's more challenging to get consolidated metrics for a specific piece of content.
  • To manage syndicated content. If you syndicate your content for publication on other domains, you want to consolidate page ranking to your preferred URL.
  • To avoid spending crawling time on duplicate pages. You want Googlebot to get the most out of your site, so it's better for it to to spend time crawling new (or updated) pages on your site, rather than crawling the desktop and mobile versions of the same pages.

Note: from @lpearlman: it's all about not indexing the "duplicates". I do think that making a domain canonical is a good idea, but that can be done once per site in the search console. or, actually, if we're already doing 301s, that will take care of it too.

References

Making Datasets Discoverable

Why work on Google Dataset Search?

Google Dataset Search relies on exposed crawlable structured data via schema.org markup, using the schema.org dataset class. It is a search engine over metadata. Working on making our pages crawlable by Google Dataset will not necessarily improve results in Google Search as they are not entirely related. It's mostly intended to simplify the research process for users. Results can be filtered based on the type of dataset required, such as tables, images or text, or on whether the dataset is free to use.

https://www.mikegingerich.com/blog/how-google-dataset-search-can-benefit-your-marketing-plan/

Unfortunately, the usage stats of Google Dataset are not available so we don't know who and how many people actually use it everyday.

The following links talk about how to expose datasets and required structure that the web crawler requires the webpages/datasets to be described as :

Links to refer

Current visibility -

Our pages that are already registered with Datacite and show up in DataCite Search example - https://search.datacite.org/works?query=Prostate+epithelial-specific+expression+of+activated+PI3K ALSO show up in Google Dataset search - https://datasetsearch.research.google.com/search?query=Prostate%20epithelial-specific%20expression%20of%20activated%20PI3K&docid=L2cvMTFqOWJfdGNzNg%3D%3D

It shows up under the name and banner of DataCite, though the DOI link routes to GUDMAP.

Google's rich results test shows the valid dataset from Datacite and highlights missing fields - https://search.google.com/test/rich-results?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset&id=u5nHSzxhLFCnjuJsGuHV5w

Findings of Google Dataset investigation -

• Benefit of incorporating JSON-LD ourselves is that its more legible and anybody can read and crawl the metadata as it is embedded in HTML, we can easily see metadata generated for the page using the dev tools and no calls to external services like DataCite is needed (no external dependency) Example - https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/

image

Our pages would show up in the result as opposed to Datacite showing up in the result as would be the case here -

• The alternative way suggested here is to use Datacite - https://github.com/informatics-isi-edu/chaise/issues/1801

As per the report below, portals like Kaggle and Figshare show up in results more often – “Pick popular portals to publish than personal ones if the publisher is not consistent. For instance, sites like Kaggle will show up in results more than a one time published portal. The report stated that data repositories, such as Figshare, Zenodo, DataDryad, Kaggle Datasets and many others, are an excellent way to ensure dataset persistence. Many of these repositories have agreements with libraries to preserve data in perpetuity.”

Report - https://analyticsindiamag.com/google-dataset-search-analysis-report/

Paper - https://research.google/pubs/pub49385/

Which properties to include in the metadata?

Required properties - name, description https://developers.google.com/search/docs/data-types/dataset#dataset

Stats of which popular properties are generally included in the metadata (from Google’s paper above) image

List and detailed description of all properties - https://schema.org/Dataset

Sample structure -

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "#DOI link",
  "name": "#A descriptive name of a dataset",
  "description": "#A short summary describing a dataset\
                  MANDATORY ATTRIBUTES END HERE, REST ARE GOOD TO HAVE",

  "spatialCoverage": "#Only include this property if the dataset has a spatial dimension. \
                      For example, a single point where all the measurements were collected, or the coordinates of a bounding box for an area. \
                      It should be Text or Place",
  "temporalCoverage": "#The data in the dataset covers a specific time interval e.g. 1950-01-01/2013-12-18 or 2008",
  "identifier": "#Use the identifier property to attach any relevant Digital Object identifiers (DOIs) or Compact Identifiers. \
                 If the dataset has more than one identifier, repeat the identifier property or use an array. ",
  "license": "#A license under which the dataset is distributed",
  "isAccessibleForFree": "#Boolean to indicate whether the dataset is accessible for free",
  "sourceOrganization": "#The Organization on whose behalf the creator was working",
  "dateCreated": "#The date on which the CreativeWork was created or the item was added to a DataFeed",
  "datePublished": "#Date of first broadcast/publication",
  "dateModified": "#Date of last update",

  "publisher": {
     "@type":"Organization",
     "name":"#Name of publisher"
  },
  "provider": {
     "@type":"Organization",
     "name":"#Name of provider"
  },
  "funder": {
     "@type":"Organization",
     "name":"#Name of funder"
  },

  "creator": {
    "@type": "Person",
    "sameAs": "#To uniquely identify individuals, use ORCID ID as the value",
    "givenName": "#First name",
    "familyName": "#Last name",
    "name": "#Name"
  },
  "keywords": "#Keywords summarizing the dataset",
  "url": "#Location of a page describing the dataset",
  "sameAs": "The URL of a reference web page that unambiguously indicates the dataset's identity",
  "variableMeasured": "#The variable that this dataset measures. For example, temperature or pressure",
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "name": "#Datasets are often published in repositories that contain many other datasets. \
             The same dataset can be included in more than one such repository. \
             You can refer to a data catalog that this dataset belongs to by referencing it directly - https://schema.org/DataCatalog"
  },
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "#The link for the download"
  },
}
</script>

Example - https://developers.google.com/search/docs/data-types/dataset#example

Json-ld

https://www.w3.org/ns/json-ld

Keywords are denoted with @ (example - @id, @type etc) - https://www.w3.org/TR/json-ld11/#syntax-tokens-and-keywords

Sample json-ld

URL - https://www.rebuildingakidney.org/chaise/record/#2/Common:Collection/RID=Q-3K5C

Datacite generated json-ld - https://api.datacite.org/dois/application/ld+json/10.25548/s5re-nvce

Google Dataset search result - https://datasetsearch.research.google.com/search?query=Histology%20of%20Human%20and%20Mouse%20embryonic%20and%20fetal%20kidney&docid=L2cvMTFqOWI3bDIwNQ%3D%3D

Below is the json-ld with the potential variable to map from and the value mentioned

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Dataset",
  "@id": "{{{Persistent_ID}}} - https://doi.org/10.25548/s5re-nvce",
  "name": "{{{Title}}} - Histology of Human and Mouse embryonic and fetal kidney",
  "description": "{{{Description}}} - A collection of histological sections of the developing human (CS13-week 23) and mouse (E11.5-P2) kidney. ",
  "keywords": "#Important attribute should have value like here https://api.datacite.org/dois/application/ld+json/10.25548/s5re-nvce, value and source pending",
  "url": "{{{window.location}}} - https://www.rebuildingakidney.org/chaise/record/#2/Common:Collection/RID=Q-3K5C",
  "datePublished": "{{{Release_Date/RCT}}} - 2017-09-07 15:01:21",
  "dateModified": "{{{RMT}}} - 2017-12-07 15:29:22",
  "variablesMeasured":"??",

  "publisher": {
    "@type":"Organization",
    "name":"{{{keyValues.Consortium}}} - GUDMAP"
  },
  "provider": {
    "@type":"Organization",
    "name":"{{{keyValues.$fkey_Common_Collection_Data_Provider_fkey.values.Name}}} - University of Southern California"
  },
  "creator":{
    "@type":"Person",
    "name":"{{{keyValues.$fkey_Common_Collection_Principal_Investigator_fkey.values.Full_Name}}} - Andrew McMahon"
  },

  "license": "??",
  "isAccessibleForFree": true,
  "includedInDataCatalog": {
    "@type": "DataCatalog",
    "url": "{{{window.location.hostname}}} - www.rebuildingakidney.org"
  }
}
</script>

• Another way suggested by a Google representative to inform search engines about our pages is using https://www.sitemaps.org to list all our datasets. It’s not a mandatory requirement but allows crawlers to find our datasets more easily.

https://developers.google.com/search/docs/advanced/sitemaps/overview

How to test?

The way to test used to be the structured data testing tool but it’s now on the way to deprecation (so not recommended) - https://search.google.com/structured-data/testing-tool/u/0/?utm_campaign=devsite&utm_medium=jsonld&utm_source=dataset

The new alternative is the rich results tool. – https://developers.google.com/search/docs/guides/prototype https://search.google.com/test/rich-results

Example result - https://search.google.com/test/rich-results?id=e2vghHBltBW3iGCZjYsqXw

How to assess the validity of your markup, or the impact it’s having on site performance - https://www.schemaapp.com/schema-markup/know-schema-markup-working/

Screen Shot 2021-06-11 at 2 29 00 PM

Questions

Google Search Console

Google Search Console is a platform that allows you to gain key organic search insights as well as a greater understanding of Google’s indexing process. On the surface, Search Console allows us to optimize the content of our website to increase website traffic, understand the device and location of our users as well as fixing navigation errors and crawling bugs.

It provides the last 16 months of data so we can see year over year growth as well.

Search Console offers tools and reports for the following activities:

* Confirmation that Google can locate and crawl our website.

* Fixing indexing problems and re-indexing requests for new or updated content.

* Analyzing Google Search traffic data for your website: how frequently your site shows up in Google Search, as a result 
  of which search queries is our site shown, how often searchers click through on your website for those queries, and 
  much more.

* Receiving notifications when Google comes across indexing, spam, or other black hat concerns on our website.

* Showing the sites that link to our website.

* Troubleshooting AMP problems, mobile usability, and other Search features.

* It can flag and tell us on a high level and on a low level, all the problems with the newly added metadata for dataset. Once we've resolved any issues in the markup, you can click “Validate fix” to get Google to recrawl and revalidate.

* Google can issue a manual action against a site if one of their human reviewers established that the pages on the site are not compliant with Google’s webmaster quality guidelines. Most of the issues reported can result in lower ranking or omission from search results. If a site is affected by a manual action, Google will notify in the Manual Actions report and the Search Console message center.

* The mobile usability report lets us know about the problems the website users may face while accessing and using the website via mobile devices.

* If there are some pages on a website that is wanted to be kept private, you can use Google Search Console. It will notify Google about the links you don’t want to be indexed, the links you want to hide from search results, the links you want to conceal now but index them in the future and the URLs that include thin content.

* The Crawl stats report tells you how often Google crawls your site, and when they crawl it.
Screen Shot 2021-06-11 at 12 24 49 PM

https://search.google.com/search-console/about

https://declan-kay117.medium.com/google-search-console-guide-for-2019-198c8ddc0357

https://www.schemaapp.com/schema-markup/how-to-measure-the-impact-of-structured-data/

How to prevent site from getting indexed by search engines

This could be used for preventing staging and dev sites from getting indexed and making sure any experiment does not translate into a lasting result. To see if your dev or test site is indexed by Google, do a search for site:dev.yoursite.com in Google Search.

There are various ways to stop indexing as elaborated below:

  1. To block using a robots.txt file (Minimal approach) You can edit and test your robots.txt using the robots.txt Tester tool. The disadvantage of this approach is a possibility of accidentally going live on prod with this robots.txt file which blocks the entire site like here.

    User-agent: *   
    Disallow: / 
    
  2. Using noindex

    The issue with a tag like that though, is that you have to add it to each and every page. Crawlers that obey robotstxt won't see a noindex directive on a page if said page is disallowed for crawling. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it.

    <meta name="robots" content="noindex,nofollow">

  3. X-Robots-Tag

    To make the process of adding the meta robots tag to every single page of your site a bit easier, the search engines came up with the X-Robots-Tag HTTP header. This allows you to specify an HTTP header called X-Robots-Tag and set the value as you would the meta robots tags value. The cool thing about this is that you can do it for an entire site. If your site is running on Apache, and mod_headers is enabled (it usually is), you could add the following single line to your .htaccess file:

    Header set X-Robots-Tag "noindex, nofollow"

    And this would have the effect that that entire site can be indexed. But would never be shown in the search results.

  4. HTTP authentication and/or IP restriction (Advised Approach)

    robots.txt and noindex instructions can always be ignored by crawlers, while HTTP authentication and/or IP restriction definitely keep them out.

    Many blogs recommends that staging environments should be protected with an HTTP username and password (HTTP authentication), or the access should be limited to certain IP addresses. Not only will this keep Google from ever seeing the site – it’ll also keep real people (other than those you want, of course) from seeing it, too.

Clone this wiki locally