The storage pipeline has been developed with the following in mind.
SPO Storage Analysis: The storage analysis pipeline delivers an extensive SharePoint Online (SPO) storage analysis. This ensures that all storage usage is meticulously accounted for, including detailed insights into individual site collections, document libraries, and specific file sizes. The analysis also incorporates trends in storage consumption over time, identifies unused or duplicate files, and provides recommendations for optimizing storage allocation and usage efficiency
Delta Pulls for Efficient Cost Management: The tool supports delta pulls, allowing for efficient cost management by only retrieving changes since the last data pull. This reduces the MGDC costs and ensures up-to-date information without unnecessary data processing.
Please go and run the Storage Forecast
pipeline before running this one. This will ensure you have full visibility of all MGDC costs associated with the solution before you run it.
Further details can be found here
Login to your Synapse Studio and import the pipeline.
- Download the StorageExploration_LowCode.zip
- From the Home menu, navigate to
Integrate
- Import the pipeline from the + button. Browse to the downloaded pipeline template.
- Select your Linked Services (created following Jose's blog) and click Open Pipeline. This will import 1 pipeline, 4 datasets and 1 notebook into your Synapse Studio.
- Click Publish all > Publish
Great you are now ready to execute the pipeline to obtain a full snapshot. To obtain a full snapshot we need to execute a full pull. To achieve this we need to provide both the same start and end date to the pipeline
- Navigate to the Integrate Menu and select the
Oversharing_LowCode.zip
pipeline. ClickAdd Trigger
>Trigger now
- Populate full pull parameters and click
OK
. (StartTime
andEndTime
the same)
- Navigate to the Monitor tab to see the execution details. Wait for the pipeline to
Complete
. Typically this will be 25 minutes.
- Once complete we can check the storage account for extracted data. - If you're pipeline failed then please check the Troubleshooting Section
- If the pipeline
Succeeded
, then navigate to your storage account using the Azure Portal and check for data lake. Click on the container blade and open the container you inputted for the StorageContainerName parameter for the pipeline run
There should be thee folders in the root container. A fourth deleted
folder will be added when you perform a delta pull
latest - this hold the latest version of the processed dataset. PowerBI is hooked up to this folder. More on this later
/ latest / permissions / *.json
/ latest / datasetname / *.json
raw - this holds all the data that is pulled from MGDC in it's unprocessed form.
/ raw / permissions / 2024 / 07 / 01 / 13ff9026-d8e7-452d-8369-675ad3793842 / *.json
/ raw / datasetname / YYYY / MM / DD / RunId / *.json
temp - this holds a copy of the latest that is used for merging with deltas in future runs.
/ temp / permissions / *.json
/ temp / datasetname / *.json
Note
The dates we use in the folder hierarchy is from the end date parameter.
- Check that you have data in the latest folder for each dataset. If so we can attempt to hook up the PowerBI template
todo - Rework PowerBI
include PowerBI hook up, delta pull. Then link to scheduling