-
Notifications
You must be signed in to change notification settings - Fork 0
Data preparation ‐ New datasets
Datasets consist of a set of statistics and their associated metadata. The dataset is built in SW3 as a data cube from a data table and reference, or lookup, data.
For new datasets you need to have a data table. The table should be in CSV (UTF-8) format and contain columns for:
- data values (including rows for any calculated totals or averages you want to include)
- all relevant dimensions and reference codes
- measure or data types
- standardised shorthand note codes to label specific data values - you must include this column, even if you do not have any note codes to include
In this guidance, we’ll be using the example scenario of data for council tax bands. Here are example rows of a data table for this scenario:
AreaCode | YearCode | BandCode | NoteCodes | Measure | Data |
---|---|---|---|---|---|
512 | 199697 | A- | 1 | 256.88 | |
512 | 199697 | A | x | ||
512 | 199697 | B | 1 | 359.64 | |
512 | 199697 | C | e | 1 | 411.01 |
512 | 199697 | D | 1 | 462.39 |
The CSV file should be in UTF-8 format and have:
- values separated by commas
- headings in the first row
- " " used for quote marks, if needed
- columns in any order
- data values as numeric values only
Column headings should have meaningful names, so that you know what each column contains. This will be important when uploading your CSV into SW3. They can have spaces between words if required.
You should include rows in your data table for any totals or averages you want to include with the dataset. The rows should contain:
- totals or averages
- correct references codes or date formats
- a standard note code indicating whether the data value is a total (t) or an average (a)
In our council tax example, the AreaCode dimension refers to local authorities. If you wanted to include a total for the whole of Wales, you would need to include a row containing the total, the reference code for Wales [596] and a ‘t’ note code. If the value for Wales was an average, this would be an ‘a’ note code.
If your data contains hierarchies with multiple layers - for example economic regions, which are groupings of local authorities - you can provide multiple totals or averages. For example, a total for each economic region and a total for Wales.
When you build your dataset in SW3, you’ll need to select reference data for each dimension in your data table. In our example table, this would mean selecting lookup tables for ‘AreaCode’ and ‘BandCode’, and confirming date formats for 'YearCode'. You do not need to add lookup tables for data value notes.
SW3 has standardised, centrally-managed lookup tables for:
- age
- ethnicity
- geography
- religion
- sex and gender
SW3 does not use lookup tables for dimensions containing dates, instead taking a standardised approach to formatting.
Using standardised reference data for these data types:
- ensures better consistency across StatsWales
- standardises English and Welsh descriptions
- improves usability for data consumers
You can access the standardised lookup tables [link TBD] at any time. You must use the reference codes that appear in these lookup tables in your data table. This will ensure that the system can build your dataset effectively and with less chance of errors.
SW3 contains standardised lookup tables and hierarchies for:
- Agricultural regions
- Built-up areas
- Economic regions
- Education consortia
- Electoral wards
- ITL (International Territorial Levels) regions
- Fire and rescue authorities
- GP practices
- Hospitals
- National parks
- Local authorities
- Local authorities divided into areas inside and outside national parks
- Local health boards
- Primary care clusters
- Police authorities
- Rail stations
- Schools
- Senedd constituencies
- Senedd electoral regions
- Super output areas (MSOAs and LSOAs)
- UK regions
- Westminster constituencies
Local authorities can be grouped into the following hierarchies:
- Agricultural regions
- City regions
- Economic regions
- Education consortia
- Fire and rescue authorities
- Local health boards
- Police authorities
If your data involves any of these hierarchies, you need to include values for all hierarchical groupings. These will be either totals or averages. For example, the values for each economic region will be a total of all the local authorities in that region.
SW3 can only accept certain date formats in your data table. When you build your dataset in SW3, for dimensions containing dates you’ll be asked about the date formats you’ve used.
The dates that will appear on the consumer website, that is the English and Welsh descriptions, are standardised and built into the system.
For example, months, quarters or years for which data values apply to.
Format in data table | Example | How it will appear on website |
---|---|---|
YYYYYYYY | 20222023 | 2022-23 |
YYYY-YYYY | 2022-2023 | 2022-23 |
YYYY/YYYY | 2022/2023 | 2022/23 |
YYYYYY | 202223 | 2022-23 |
YYYY-YY | 2022-23 | 2022-23 |
YYYY/YY | 2022/23 | 2022/23 |
YYYY | 2022 | 2022 |
Any of the year formats followed by a quarter code.
Quarter format in data table | Example | How it will appear on website |
---|---|---|
Qx | 2022Q1 | Q1 2022 |
_Qx | 2022_Q1 | Q1 2022 |
-Qx | 2022-Q1 | Q1 2022 |
x | 20221 | Q1 2022 |
_x | 2022_1 | Q1 2022 |
-x | 2022-1 | Q1 2022 |
Any of the year formats followed by a month code.
Month format in data table | Example | How it will appear on website |
---|---|---|
MMM | 2022Jan | January 2022 |
mMM | 2022m01 | January 2022 |
MM | 202201 | January 2022 |
You also need to know the type of year the dimension covers, either:
- Calendar (1 January to 31 December)
- Meteorological (1st March to 28th or 29th February)
- Financial (1 April to 31 March)
- Tax (6 April to 5 April)
- Academic (1 September to 31 August)
Your dataset can contain data for multiple time periods. For example, monthly values with quarterly and yearly totals.
You must use consistent year formatting for all time periods present in the dataset. For example if you use ‘YYYY’ for years, you could use ‘YYYYQx’ for quarters and 'YYYYMM' for months.
For example, specific dates when data values were collected.
Format in data table | Example | How it will appear on website |
---|---|---|
DD/MM/YYYY | 01/01/2022 | 1 January 2022 |
DD-MM-YYYY | 01-01-2022 | 1 January 2022 |
YYYY-MM-DD | 2022-01-01 | 1 January 2022 |
YYYYMMDD | 20220101 | 1 January 2022 |
For dimensions not containing dates or any of the standardised reference data, you should prepare your own lookup tables, which you will upload to SW3. In our example, this would be the case for the ‘BandCode’ dimension, which uses domain-specific reference codes for council tax bands.
Your lookup table should be in csv format and in one of the following formats.
Heading | What the column contains |
---|---|
refcode | Reference codes |
language | Language used, either 'en' or 'cy' |
description | Description |
sort_order | Sort order |
notes | Notes |
Example part of lookup table for tax band code:
refcode | language | description | sort_order | notes |
---|---|---|---|---|
A- | en | A- | 1 | |
A- | cy | A- | 1 | |
A | en | A | 2 | |
A | cy | A | 2 | |
B | en | B | 3 |
Heading | What the column contains |
---|---|
refcode | Reference codes |
description_en | Description (English) |
description_cy | Description (Welsh) |
sort_order | Sort order |
notes_en | Notes (English) |
notes_cy | Notes (Welsh) |
Example part of lookup table for tax band code:
refcode | description_en | description_cy | sort_order | notes_en | notes_cy |
---|---|---|---|---|---|
A- | A- | A- | 1 | ||
A | A | A | 2 | ||
B | B | B | 3 | ||
C | C | C | 4 | ||
D | D | D | 5 |
It is important these specific column headings are used.
Lookup table notes can be provided, but are not currently shown in the consumer view in SW3. If notes contain important information, you should ensure this information is also provided in the most appropriate metadata section.
When you add reference data in SW3, you’ll also need to add what you want the dimension to be called on the consumer side. The name should be:
- concise
- clearly explain what the dimension contains
- in sentence case, except for proper nouns, for example 'Local authorities'
A measure or data type indicates what the data value represents. You must include a column for this in your data table, even if there is only one type of measure or data value in your dataset. Including the measure or data type helps consumers better understand the data in your dataset.
For a dimension containing measure or data types, you should prepare your own lookup table, which you will upload to SW3. It should be in csv format and contain the following columns and contents:
Heading | What the column contains |
---|---|
refcode | Reference codes |
measure | Any of the following:
|
format | Any of the following:
|
sort_order | Sort order |
description_en | Description in English of what the measure actually represents, such as ‘House price as a percentage of median salary’ |
description_cy | Description in Welsh |
Example measure or data type lookup table for our council tax example:
refcode | measure | format | sort_order | description_en | description_cy |
---|---|---|---|---|---|
1 | Numerical value | Decimal | 1 | Council tax in £ | Treth cyngor mewn £ |
2 | Percentage | Percentage | 2 | % of median monthly salary | % o ganolrif cyflog misol |
3 | Rank | Integer | 3 | Rank of council tax band | Safle band treth gyngor |
Please note this example is for demonstration purposes only and is not a genuine lookup table for council tax data.
You should only add a note to a specific data value if it's necessary to understand the data. However, you must include a column for note codes in your data table, even if you do not have any note codes to include.
SW3 uses shorthand note codes, with standardised explanations, closely following public sector standards. These are as follows:
Shorthand (Not case sensitive) |
Meaning | Use |
---|---|---|
a | Average | When a data value is an average of other values |
c | Confidential information | For example, if a data value has been suppressed because you could identify details about specific people from the data. Note that there is no way in SW3 to mask or suppress values, so this must be done in your data table. |
e | Estimated | When a data value is an estimated value |
f | Forecast | When a data value is a calculated future value instead of an observed value |
k | Low figure | A low figure that appears as a zero when rounded |
p | Provisional | When a data value is yet to be finalised, or is expected to be revised |
r | Revised | When a data value has been revised since it was first published |
t | Total | When a data value is a total of other values |
u | Low reliability | When a data value is of low statistical quality |
x | Missing data | For example, where a data value is not collected in a region |
z | Not applicable | For example, in tables of employment where people under 16 cannot legally be employed |
If you need to provide any custom explanations to clarify any note codes used, these should be provided in the most appropriate metadata section.
If you need to add multiple note codes to a single data value, these should be comma separated, for example ‘p,f’.
All other notes about dimensions or the dataset should be provided in the most appropriate metadata section.
Before you create a new dataset in SW3, you should prepare all the related metadata you need. You should provide this all in the same language. This should be whichever language you’ll use when using SW3, either English or Welsh. Translations of metadata into the other language are processed with all other translations.
The following is all the metadata you’ll need to provide.
- The title that will appear with the dataset on the StatsWales website
- Should be short, descriptive and unique
- The system will tell you if you enter a title that is already in use by a live, published dataset
- In short, simple sentences explain what this dataset is about and what it shows
- Describe which dimensions have been used and, if needed, why they’ve been used
- This should ideally not be more than 2-3 paragraphs long
- Anything related to data collection or data quality should not be included, and be kept to their respective sections
In short, simple sentences explain either:
- the methodology used to collect the data
- the methodology used to calculate the data
- both of the above
In short, simple sentences explain any issues or methodological changes related to the dataset including any:
- notes relevant to all data values in the dataset
- custom explanations needed to clarify any data value note codes used
If available, you can include a high-level summary of the statistical quality report. You should add a link to the report in the ‘Related reports’ section, not as a link here.
There is also a subsection within this section to indicate if any rounding has been applied, and if it has, explain it in short, simple sentences.
This is split into data providers, such as ‘Office for National Statistics (ONS)’, and sources from those data providers, such as ‘2021 Census’. In some instances, you may only know the data provider and not a specific source from that provider, for example ‘Health Behaviours in School-aged Children (HBSC)’.
You can add multiple sources if required.
The list of data providers and sources [link TBD] is centrally managed. If you need to have a data provider or source added to the list, you need to:
- email ... at ... [process TBD]
- provide the name of the data provider and source in both English and Welsh
You’ll be notified once the data provider and source have been added to the list. You’ll then be able to select them in SW3.
You should provide links to any related reports, which may include:
- GOV.WALES statistics and research releases
- statistical quality reports
- other related work
You need to provide the URL of the report, and the link text that will appear on the webpage. The link text should be the title or a short description of the report, not the URL.
If there are different URLs for the report for different languages, you should provide both URLs. The link text should clearly state what language the linked report is in. For example you might add these two links:
- GOV.WALES statistics and research releases (English)
- GOV.WALES statistics and research releases (Welsh)
Indicate whether the dataset will be regularly updated or not - either because it’s one-off data or it’s stopped being updated. You have the option to provide the period between updates in days, weeks, months, quarters or years.
There are different types of official statistics designations. The definition of each of these are:
Designation | What it means |
---|---|
Official statistics | Statistics produced by Crown bodies and other organisations listed within an Official Statistics Order, on behalf of the UK government or devolved administrations |
Accredited official statistics (formerly National Statistics) | Official statistics that the Office for Statistics Regulation (OSR) has independently reviewed and confirmed comply with the standards of trustworthiness, quality and value |
Official statistics in development (formerly experimental statistics) | Official statistics that are undergoing a development; they may be new or existing statistics, and will be tested with users, in line with the standards of trustworthiness, quality and value |
No designation | Any other statistics |
There are 13 high-level topics, most of which have multiple secondary topics. A single topic tag consists of ‘high-level + secondary topic’. The only exceptions are the ‘Tourism’ and ‘Welsh language’ topic tags which are high level only.
For example a healthcare related dataset may be tagged to a single topic tag ‘Environment, energy and agriculture: Farming’. Or it could also be tagged to a second topic tag of ‘Business, economy and labour market: Business’. You should select as many topics as are relevant to the dataset.
This selection will allow consumers to more easily locate and identify datasets of interest to them. It’s important to consider the different ways consumers may categorise what a dataset covers.
Business, economy and labour market
- Business
- Economic indices
- Employment
- Gross disposable household income
- Gross value added
- Research and development
Crime, fire and rescue
- Crime and justice
- Fire and rescue incidents
- Fire services
- Police
Education and training
- Apprenticeships
- Further education
- Higher education
- Lifelong learning
- Pupils
- Schools
- Student support
- Teachers and support staff
- Youth work
Environment, energy and agriculture
- Air quality
- Energy
- Farming
- Flooding
- Fly-tipping
- Greenhouse gases
- Land
- Waste management
Finance and tax
- Capital
- Council tax
- Land transaction tax
- Landfill disposals tax
- Non-domestic rates (business rates)
- Revenue
- Settlement
Health and social care
- Ambulance services
- Children
- Coronavirus (COVID-19)
- Dental services
- Health finance
- Hospital and other care setting activity
- General practice
- Maternity, births and conceptions
- Mental health
- Performance and waiting times
- Primary care and community services
- Staff
- Social care and day care
- Social services
- Substance misuse
Housing
- Affordable housing
- Demolitions
- Disabled facilities grants
- Hazards, licences and housing quality
- Help to buy
- Homelessness
- Household numbers
- Housing improvement
- Housing stock and need
- New house building
- Possessions and evictions
- Private sector rents
- Social housing
- Traveller caravan count
People, identity and equality
- Age
- Disability
- Ethnicity
- Gender identity
- Language
- Marital status
- Migration
- National identity
- Population
- Pregnancy and maternity
- Religion
- Sex
- Sexual orientation
- Socio-economic status
Poverty
- Communities First
- Deprivation
- Discretionary assistance fund
- Income poverty
Tourism
Transport
- Air
- Rail
- Road
- Sea
Welsh and local government
- Electoral register
- Government workforce
- Local authority performance
- Senedd constituencies
Welsh language