Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importer #55

Draft
wants to merge 51 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
5cf9c52
Importer
sergiucoman Oct 26, 2023
f39b66d
Added topics in other languages
sergiucoman Oct 26, 2023
c01028f
clean whitespaces. move metadata at the beginning. remove metadata image
andreituicu Nov 10, 2023
81109be
fix date format
andreituicu Nov 10, 2023
1596810
use AEM JSON rendition to extract tags and map them with the xcel sheet
andreituicu Nov 15, 2023
0d418c4
Crawling of topics and categories
sergiucoman Nov 16, 2023
ffedc45
Removed hero image
sergiucoman Nov 16, 2023
7bb9803
extract new trend tags
andreituicu Nov 17, 2023
8bf22d7
extract new trend tags
andreituicu Nov 17, 2023
3434bc7
extract new trend tags
andreituicu Nov 18, 2023
b882dea
process links
andreituicu Nov 19, 2023
bbe8bf2
import filter pages: topic, category, year
andreituicu Nov 20, 2023
51ae029
Extracted new category
sergiucoman Nov 20, 2023
e6ded05
import brightcove
andreituicu Nov 23, 2023
9df29ca
Added import for authors
sergiucoman Nov 24, 2023
ebc083e
add back hero-image
andreituicu Nov 24, 2023
358dafc
get raw urls from sitemap
andreituicu Nov 24, 2023
55ead3b
transform legacy section titles into h3s
andreituicu Nov 24, 2023
1576fef
107 - [Importer] Handle Youtube videos
andreituicu Nov 24, 2023
c05123a
import images with links and use the customer urls
andreituicu Nov 27, 2023
791c87b
fix - EDS URL output
andreituicu Nov 28, 2023
a851690
use a complete list of tags for import
andreituicu Nov 28, 2023
563361f
124 - Import links to subdomains of servicenow.com
andreituicu Nov 29, 2023
ee5967c
139 - Browser tab title suffix
andreituicu Nov 29, 2023
8d484cd
fix - linked images, partial brs
andreituicu Dec 2, 2023
000cec2
fix more spacing issues in import
andreituicu Dec 2, 2023
4934fd3
add exception
andreituicu Dec 3, 2023
aaf5eb7
fix more BR issues
andreituicu Dec 3, 2023
20e7c75
use columns block for authors
andreituicu Dec 4, 2023
e827448
fixes for pages and authors
andreituicu Dec 4, 2023
867f44c
fix - whitespaces
andreituicu Dec 5, 2023
6388d3d
fix - h6s
andreituicu Dec 5, 2023
463c1c7
fix - more h3s
andreituicu Dec 5, 2023
ac2673d
fix - whitespace issues
andreituicu Dec 5, 2023
ee32a02
fix - remove copyright variant
andreituicu Dec 7, 2023
a92dd5c
fix - filter page meta title
andreituicu Dec 7, 2023
128577a
fixes
andreituicu Dec 7, 2023
f4d587b
fixes
andreituicu Dec 8, 2023
a0bbfa5
fixes
andreituicu Dec 8, 2023
faa1fd5
fixes
andreituicu Dec 8, 2023
e46642d
fix - copyright translations
andreituicu Dec 9, 2023
e3ee791
fix - brightcove
andreituicu Dec 11, 2023
c3d4768
Added specifics for 2016-2019 blogs
sergiucoman Dec 13, 2023
6b5c00a
Auto-merged main into importer on deployment.
aem-code-sync[bot] Dec 13, 2023
e37e33b
include embed-video
sergiucoman Dec 13, 2023
c401046
Auto-merged main into importer on deployment.
aem-code-sync[bot] Dec 13, 2023
ae90459
Fixed import date
sergiucoman Dec 18, 2023
129c2f2
sitemap tools
florentin Dec 21, 2023
0e5753a
remove undeed urls from the diff output
andreituicu Dec 22, 2023
c3ae8f6
Added canonical url
sergiucoman Feb 8, 2024
5ad6e04
import - Canonical URLs
andreituicu Feb 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 33 additions & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 9 additions & 7 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,23 @@
},
"homepage": "https://github.com/adobe/aem-boilerplate#readme",
"devDependencies": {
"@babel/core": "7.21.0",
"@babel/eslint-parser": "7.19.1",
"@esm-bundle/chai": "4.3.4-fix.0",
"@semantic-release/changelog": "6.0.3",
"@semantic-release/exec": "6.0.3",
"@semantic-release/git": "10.0.1",
"semantic-release": "21.0.5",
"@babel/core": "7.21.0",
"@babel/eslint-parser": "7.19.1",
"@web/test-runner": "0.15.1",
"@web/test-runner-commands": "0.6.5",
"chai": "4.3.7",
"eslint": "8.35.0",
"eslint-config-airbnb-base": "15.0.0",
"eslint-plugin-import": "2.27.5",
"@esm-bundle/chai": "4.3.4-fix.0",
"@web/test-runner": "0.15.1",
"@web/test-runner-commands": "0.6.5",
"semantic-release": "21.0.5",
"sinon": "15.0.1",
"stylelint": "15.2.0",
"stylelint-config-standard": "30.0.1"
"stylelint-config-standard": "30.0.1",
"xmlhttprequest" : "1.8.0",
"xmldom" : "0.6.0"
}
}
141 changes: 141 additions & 0 deletions tools/importer/import-author.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
/*
* Copyright 2023 Adobe. All rights reserved.
* This file is licensed to you under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License. You may obtain a copy
* of the License at http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under
* the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
* OF ANY KIND, either express or implied. See the License for the specific language
* governing permissions and limitations under the License.
*/
/* global WebImporter */
/* eslint-disable no-console, class-methods-use-this */

const pageUrl = "https://main--servicenow--hlxsites.hlx.page";

Check failure on line 15 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Strings must use singlequote
const servicenowUrl = 'https://www.servicenow.com';

function isServiceNowLink(link) {
return (link.host.startsWith('localhost') || link.host.endsWith('servicenow.com'));

Check failure on line 19 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Expected indentation of 2 spaces but found 4
}

function isBlogLink(link) {

Check failure on line 22 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Block must not be padded by blank lines

return isServiceNowLink(link) &&

Check failure on line 24 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Expected indentation of 2 spaces but found 4

Check failure on line 24 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

'&&' should be placed at the beginning of the line
(link.pathname.startsWith('/blogs')
|| link.pathname.startsWith('/fr/blogs')
|| link.pathname.startsWith('/de/blogs')
|| link.pathname.startsWith('/uk/blogs')
|| link.pathname.startsWith('/nk/blogs'));
}

function getServiceNowUrl(link) {
return new URL(new URL(link.href).pathname, servicenowUrl);

Check failure on line 33 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Expected indentation of 2 spaces but found 4
}

function getPageUrl(link) {
return new URL(new URL(link.href).pathname.replace('.html', ''), pageUrl);

Check failure on line 37 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Expected indentation of 2 spaces but found 4
}

const createMetadataBlock = (main, document, url) => {

Check failure on line 40 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

'url' is defined but never used

Check failure on line 40 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Block must not be padded by blank lines

const meta = {};

Check failure on line 42 in tools/importer/import-author.js

View workflow job for this annotation

GitHub Actions / build

Expected indentation of 2 spaces but found 4

// Title
const title = document.querySelector('title');
if (title) {
let titleText = title.textContent.replace(/[\n\t]/gm, '');
const suffix = ' – ServiceNow Blog';
if (titleText.endsWith(suffix)) {
titleText = titleText.substring(0, titleText.length - suffix.length);
}
meta.Title = titleText;
}

// Description
const desc = document.querySelector('[property="og:description"]');
if (desc && desc.content.trim()) {
meta.Description = desc.content.trim();
}

// Keywords
const keywords = document.querySelector('meta[name="keywords"]');
if (keywords && keywords.content) {
meta.Keywords = keywords.content;
}

const author = document.querySelector('[property="og:url"]');
var authorUrl = '';

if (author && author.content.trim()) {
authorUrl = author.content.trim();
}
// insert new hr element as last element of main
const hr = document.createElement('hr');
main.append(hr);

const blogList = WebImporter.DOMUtils.createTable([['Blog List'], ['Author ', authorUrl]], document);
const metadataBlock = WebImporter.Blocks.getMetadataBlock(document, meta);

main.append(blogList);
main.append(metadataBlock);

return meta;
};

export default {
/**
* Apply DOM operations to the provided document and return
* the root element to be then transformed to Markdown.
* @param {HTMLDocument} document The document
* @param {string} url The url of the page imported
* @param {string} html The raw html (the document is cleaned up during preprocessing)
* @param {object} params Object containing some parameters given by the import process.
* @returns {HTMLElement} The root element to be transformed
*/
transformDOM: ({
// eslint-disable-next-line no-unused-vars
document, url, html, params,
}) => {
const main = document.querySelector('body');

console.debug(url);

createMetadataBlock(main, document, url);

// CLEANUP
main.querySelectorAll('.legacyHTML, .servicenow-blog-header, .blog-author-info, .component-tag-path, .aem-GridColumn--default--4, .hero-image, .com-seperator, .servicenow-blog-list--block').forEach(el => el.remove());

main.querySelectorAll('br, nbsp').forEach((el) => el.remove());
main.querySelectorAll('img[src^="/akam/13/pixel"]').forEach((el) => el.remove());


// Processing...
main.querySelectorAll('a').forEach((link) => {
if (isServiceNowLink(link)) {
if (isBlogLink(link)) {
link.href = getPageUrl(link);
} else {
link.href = getServiceNowUrl(link);
}
}
});

return main;

},

/**
* Return a path that describes the document being transformed (file name, nesting...).
* The path is then used to create the corresponding Word document.
* @param {HTMLDocument} document The document
* @param {string} url The url of the page imported
* @param {string} html The raw html (the document is cleaned up during preprocessing)
* @param {object} params Object containing some parameters given by the import process.
* @return {string} The path
*/
generateDocumentPath: ({
// eslint-disable-next-line no-unused-vars
document, url, html, params,
}) => WebImporter.FileUtils.sanitizePath(new URL(url).pathname.replace(/\.html$/, '').replace(/\/$/, '')),
};
143 changes: 143 additions & 0 deletions tools/importer/import-filter-page.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
/*
* Copyright 2023 Adobe. All rights reserved.
* This file is licensed to you under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License. You may obtain a copy
* of the License at http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under
* the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR REPRESENTATIONS
* OF ANY KIND, either express or implied. See the License for the specific language
* governing permissions and limitations under the License.
*/
/* global WebImporter */
/* eslint-disable no-console, class-methods-use-this */

function fetchSync(method, url) {
// we use old XMLHttpRequest as fetch seams to have problems in bulk import
const request = new XMLHttpRequest();
request.open(method, url, false);
request.overrideMimeType('text/json; UTF-8');
request.send(null);
return {
status: request.status,
body: request.responseText,
}
}

function jsonRenditionURL(url, depth = 1) {
return url.replace('.html', `.${depth}.json`);
}

function getAllTags() {
const tagsURL = 'https://main--servicenow--hlxsites.hlx.live/blogs/tags.json';
const response = fetchSync('GET', tagsURL);
if (response.status === 200) {
return JSON.parse(response.body);
}
return {};
}

const createMetadataBlock = (main, document, url) => {
const meta = {};

// Title
const title = document.querySelector('title');
if (title) {
let titleText = title.textContent.replace(/[\n\t]/gm, '').trim();
const suffix = ' - ServiceNow Blog';
if (titleText.endsWith(suffix)) {
titleText = titleText.substring(0, titleText.length - suffix.length);
}
meta.Title = titleText;
}

// Description
const desc = document.querySelector('[property="og:description"]');
if (desc && desc.content.trim()) {
meta.Description = desc.content.trim();
} else {
meta.Description = 'Read about ServiceNow\'s Company News, Announcements, and Updates.';
}

// Keywords
const keywords = document.querySelector('meta[name="keywords"]');
if (keywords && keywords.content) {
meta.Keywords = keywords.content;
}

const block = WebImporter.Blocks.getMetadataBlock(document, meta);
main.append(block);

return meta;
};

export default {
/**
* Apply DOM operations to the provided document and return
* the root element to be then transformed to Markdown.
* @param {HTMLDocument} document The document
* @param {string} url The url of the page imported
* @param {string} html The raw html (the document is cleaned up during preprocessing)
* @param {object} params Object containing some parameters given by the import process.
* @returns {HTMLElement} The root element to be transformed
*/
transformDOM: ({
// eslint-disable-next-line no-unused-vars
document, url, html, params,
}) => {
const main = document.querySelector('body');

// CLEANUP
main.querySelectorAll('.legacyHTML, .servicenow-blog-header, .blog-author-info, .component-tag-path, .aem-GridColumn--default--4, .hero-image').forEach(el => el.remove());
// TODO is this ok?
main.querySelectorAll('br, nbsp').forEach((el) => el.remove());
main.querySelectorAll('img[src^="/akam/13/pixel"]').forEach((el) => el.remove());
main.querySelectorAll('.servicenow-blog-list--block, .cmp-list').forEach((el) => el.remove());

main.append(document.createElement('hr')); // create section

const jsonRendition = JSON.parse(fetchSync('GET', jsonRenditionURL(url, 5)).body);
const listTag = jsonRendition['jcr:content']?.root?.responsivegrid?.responsivegrid?.list?.tags[0]?.trim();

const allTags = getAllTags();
let blogList = null;
if (url.toString().includes('/topics/')) {
console.log('here');
// topic page
const newTredTag = allTags.topic.data.find((tag) => tag['legacy-identifier-newtrend'] === listTag)?.identifier;
blogList = WebImporter.DOMUtils.createTable([['Blog List'], ['New Trend', newTredTag]], document);
} else if (url.toString().includes('/category/')) {
// category page
const categoryTag = allTags.category.data.find((tag) => tag['legacy-identifier'] === listTag)?.identifier;
blogList = WebImporter.DOMUtils.createTable([['Blog List'], ['Category', categoryTag]], document);
} else if (params.originalURL.match(/\b\d{4}\.html$/)){
const urlParts = params.originalURL.replace('.html', '').split('/');
const year = urlParts[urlParts.length - 1];
// year page
blogList = WebImporter.DOMUtils.createTable([['Blog List'], ['Year', year]], document);
}

// TODO author

if (blogList) {
main.append(blogList);
}

createMetadataBlock(main, document, url);
return main;
},

/**
* Return a path that describes the document being transformed (file name, nesting...).
* The path is then used to create the corresponding Word document.
* @param {HTMLDocument} document The document
* @param {string} url The url of the page imported
* @param {string} html The raw html (the document is cleaned up during preprocessing)
* @param {object} params Object containing some parameters given by the import process.
* @return {string} The path
*/
generateDocumentPath: ({
// eslint-disable-next-line no-unused-vars
document, url, html, params,
}) => WebImporter.FileUtils.sanitizePath(new URL(url).pathname.replace(/\.html$/, '').replace(/\/$/, '')),
};
Loading