-
Notifications
You must be signed in to change notification settings - Fork 2
SiteMap
Some sites are really transparent with their data, documents called sitemaps
can give some information we want without scraping for it.
the root domain will sometimes include robot.txt
, a file instructing search engines what crawlers are permitted to access.
Example from target.com/robots.txt
Sitemap: https://www.target.com/sitemap_keywords-index.xml.gz
Sitemap: https://www.target.com/sitemap_stores-index.xml.gz
Sitemap: https://www.target.com/sitemap_taxonomy-categories-index.xml.gz
Sitemap: https://www.target.com/sitemap_pdp-index.xml.gz
Sitemap: https://www.target.com/sitemap_taxonomy-brand-index.xml.gz
Sitemap: https://www.target.com/sitemap_facet-categories-index.xml.gz
User-agent: *
Disallow: /*/Ntk
Disallow: /*/Ntt
Disallow: /*/Ntx
Disallow: /*%7Cd_
Disallow: /*/schoollist/
Disallow: /*BTWN
Disallow: /[path]/
Disallow: /7078046/
Disallow: /7079046/
Disallow: /AddToList
Disallow: /AddToRegistry
Disallow: /admin
Disallow: /advancedGiftRegistrySearchView
Disallow: /AjaxSearchNavigationView
Disallow: /Allons_voter
Disallow: /bp/c/
Disallow: /bp/guest_mfg_brand
Disallow: /bp/p/
Disallow: /CallToActionModalView
Disallow: /cgi-bin
Disallow: /cgi-local
Disallow: /Checkout
Disallow: /CheckoutEditItemsDisplayView
Disallow: /CheckoutOrderBillingView
Disallow: /CheckoutOrderShippingView
Disallow: /CheckoutSignInView
Disallow: /co-
Disallow: /common
Disallow: /coupons.
Disallow: /custom-reviews/
Disallow: /data
Disallow: /database/philboard.mdb
Disallow: /dir_on_server/
Disallow: /EmailCartView
Disallow: /EnlargedImageView
Disallow: /ESPDisplayOptionsViewCmd
Disallow: /ESPModal
Disallow: /ExitCheckoutCmd
Disallow: /FeaturedShowMoreOverlay
Disallow: /FetchProdRefreshContent
Disallow: /fiats
Disallow: /FiatsCmd
Disallow: /file
Disallow: /FreeGiftDisplayView
Disallow: /gam-
Disallow: /GenericRegistryPortalView
Disallow: /gc?k
Disallow: /GiftRegistrySearchViewCmd
Disallow: /gp/
Disallow: /GuestAsAnonymous
Disallow: /guestEmailNotificationView
Disallow: /HelpContent
Disallow: /igp
Disallow: /index.jhtml
Disallow: /keyword=
Disallow: /legal-contact-us/
Disallow: /list.id=1
Disallow: /LogonForm
Disallow: /m/
Disallow: /ManageOrder
Disallow: /ManageReturns
Disallow: /MediaDisplayView
Disallow: /mm/
Disallow: /moreinfo.cfm
Disallow: /news
Disallow: /np/
Disallow: /OpenZoomLayer
Disallow: /OrderItemDisplay
Disallow: /OtherDisplayView
Disallow: /p/premium-registry
Disallow: /PhotoUpload
Disallow: /pl/
Disallow: /ProductComparisonCmd
Disallow: /ProductDetailsTabView
Disallow: /PromotionDetailsDisplayView
Disallow: /PromotionDisplayView
Disallow: /qi/
Disallow: /QuickInfoView
Disallow: /ready_sit_read/index.jhtml
Disallow: /RegistryPortalCmd
Disallow: /ReportAbuse
Disallow: /reviewVote
Disallow: /script
Disallow: /SearchNavigationView
Disallow: /shop/
Disallow: /SingleShipmentOrderSummaryView
Disallow: /SOImapPriceDisplayView
Disallow: /SpecificationDefinitionView
Disallow: /splitOrderItem
Disallow: /store-locator/search-results-print
Disallow: /supertarget/index.jhtml
Disallow: /target_baby/
Disallow: /target_group
Disallow: /targetdirect_group/
Disallow: /TargetListPortalView
Disallow: /TargetStoreLocatorCmd
Disallow: /tdir/p/kids-back-to-school/
Disallow: /tsa/
Disallow: /VariationSelectionView
Disallow: /webapp
Disallow: /winnt/
Disallow: /WriteComments
Disallow: /WriteReviews
Disallow: /XCSA/
Disallow: /yr
Disallow: /s?
Disallow: /cart
Disallow: /account/
Disallow: /tracking
Disallow: /config
This isn't fully baked yet, but each sitemap is a nested set of other sitemaps which seem to follow two schema. I've added get-urlFromSiteMaps and get-urlFromUrlSet to accommodate the two schema. I envision being able to pipe sitemap content to scripts in order to get current data.
Note
I went with PowerShell to use on both Windows and Linux as I found it easier to install pwsh
on linux than finding an xml library I liked in Bash. I'm totally open to an alternative approach, but I'm not interested in figuring it out.
If you were to curl
sitemaps to files, an in line script could look like ./get-urlFromSiteMaps sitemap_pdp-index.xml.gz | xargs ./download-sitemap.sh | xargs wget -P pdp_sitemap/
.
Please update me as you learn more!