Skip to content

Commit

Permalink
repost
Browse files Browse the repository at this point in the history
  • Loading branch information
Ian Turton committed Jul 16, 2024
1 parent 1f1d381 commit 44767bf
Show file tree
Hide file tree
Showing 3 changed files with 206 additions and 0 deletions.
98 changes: 98 additions & 0 deletions _posts/2023-11-11-geojson.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
---
layout: post
title: Is GeoJSON a spatial data format?
date: 2023-11-11
categories: gis
---
# Is GeoJSON a good spatial data format?

A few days ago on Mastodon [Eli Pousson](https://fosstodon.org/@[email protected])
asked:

> Can anyone suggest examples of files that can contain location info but aren't often considered spatial data
> file formats?
>
He suggested EXIF, [Iván Sánchez Ortega](@[email protected] )
followed up with spreadsheets, and being devilish I said GeoJSON.

This led to more discussion, with people asking why I thought that, so I instead of being flippant I thought
about it. This blog post is the result of those thoughts which I thought were kind of obvious but from things
people have said since may be aren't that obvious.

I've mostly been a developer for most of my career so my main interest in a spatial data format is that:

1. it stores my spatial data as I want it to,
2. it's fast to read and to a lesser extent, write.
3. It's easy to manage.

One, seems to be obvious, if I store a point then ask for it back I want to get that point back (to the limit
of the precision of the processor's floating point). If a format can't manage that then please don't use it.
This is not common but Excel comes to mind as a program that takes good data and trashes it. If it isn't
changing [gene names into
dates](https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates) then
it's [reordering the dbf file to destroy your
shapefile](https://gis.stackexchange.com/questions/132359/how-is-attribute-data-in-dbf-file-tied-to-shapefile-location-data-in-shp-file).
GeoJSON also can fail at this as the standard says that I must store the data in WGS:84 (lon/lat), which is
fine if that is the format that I store my data in already, but suppose I have some high quality OSGB data
that is carefully surveyed to fractions of a millimetre and the underlying code does a conversion to WGS:84 in
the background and further the developer wanted to save space and limited the number of decimal places to say
6 (OK, [that was me](https://osgeo-org.atlassian.net/browse/GEOT-6650)) when it gets converted back to OSGB
I'm looking at centimetres (or worse) but given the vagaries of floating point representation I may not be
able to tell.

Two, comes from being a GeoServer developer, a largish chunk of the time taken to draw a web map (or stream
out a WFS file) is taken up by reading the data from the disk. Much of the rest of the time is converting the
data into a form that we can draw. Ideally, we only want to read in the features needed for the map the user
has requested (actually, ideally we want to **not** read in most of the data by having it already be in the
cache, but that is hard to do). So we like indexed datasets both spatial indexes and attribute indexes can
help substantially speed up map drawing. As the size of spatial datasets increases the time taken to fetch the
next feature from the store becomes more and more important. An index allows the program to skip to the
correct place in the file for either a specific feature or for features that are in a specific place or
contain a certain attribute with the requested value. This is a great time saver, imagine trying to look
something up in a big book by using the index compared to paging through it reading each page in turn.

After one or more indexes the main thing I look for in a format is a binary format that is easy to read (and
write). GeoJSON (and GML) are both problematic here as they are text formats (which is great in a transfer
format) and so for every coordinate of every spatial object the computer has to read in a series of digits
(and punctuation) and convert that into an actual binary number that it can understand. This is a slow
operation (by computer speeds anyway) and if I have a couple of million points in my coastline file then I
don't want to do 4 million slow operations before I even think of drawing something.

Three, I have to interact with users on a fairly regular basis and in a lot of cases these are not spatial
data experts. If a format comes with up to a dozen similarly named files (that are all important) that a GIS
will refuse to process unless you guess which is the important one then it is more of a pain than a help. And
yes shapefile I'm looking at you. If your process still makes use of Shapefiles please, please stop doing that
to your users (and the support team) and switch over to GeoPackages which can store hundreds of data sets
inside a single file, All good GIS products can process them by now, they have been an OGC standard for nearly
10 years. If you don't think that shapefiles are confusing go and ask your support team how often they have
been sent just the `.shp` file (or 11 files but not the `.sbn`) or how often they have seen people who have
deleted all the none `.shp` files to save disk space.

My other objection to GeoJSON is that I don't know what the structure (or schema) of the data set is until I
have read the entire file. That last record could add several bonus attributes, in fact any (or all) of the
records could do that, from a parsers view it is a nightmare. At least GML provides me with a fixed schema and
enforces it through out the file.

When I'm storing data (as opposed to transferring it) I use PostGIS, it's fast and accurate, can store my data
in whatever projection I chose and is capable of interfacing with any GIS program I am likely to use, and if
I'm writing new code then it provides good, well tested libraries in all the languages I care about so I don't
have to get into the weeds of parsing binary formats. If I fetch a feature from PostGIS it will have exactly
the attributes I was expecting no more or less. It has good indexes and a nifty DSL (SQL) that I can use to
express my queries that get dealt with by a cool query optimiser that knows way more than I do about how to
access data in the database.

If for some reason I need to access my data while I'm travelling or share it with a colleague then I will use
a GeoPackage which is a neat little database all packaged up in a single file. It's not a quick as PostGIS so
I wouldn't use it for millions of records but for most day to day GIS data sets it's great. You can even store
you QGIS styles and project in it to make it a single file project transfer format.

One final point, I sometimes see people preaching that we should go cloud native (and often serverless) by
embracing "modern" standards like GeoJSON and COGs. GeoJSON should never be used as a cloud native storage
option (unless it's so small you can read it once and cache it in memory in which case why are you using the
cloud) as it is large (yes, I know it compresses well) and slow to parse (and slower still if you compressed
it first) and can't be indexed. So that means you have to copy the whole file from a disk on the far side of a
slow internet connection. I don't care if you have fibre to the door it is still slow compared to the disk in
your machine!

![The Jack Sparrow worst pirate meme but for GeoJSON](/images/geojson.jpg )
108 changes: 108 additions & 0 deletions _posts/2024-07-16-spelling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
layout: post
title: Adding a spell check to QGIS
date: 2024-07-16
categories: foss
---

# Adding a Spell Check to QGIS

(Or what to do on a rainy bank holiday in Glasgow)

This Monday was a local bank holiday in Glasgow (or at least the university) as a remnant of when the whole
town took a train to Blackpool in the same two weeks so that the ship builders and steel works could stop in a
coordinated fashion. As is required in the UK the weather was awful so I stayed in and being bored looked at
my long list of possible projects. I picked one that has been kicking around on the list for a while adding a
spell checker for QGIS. As a dyslexic I have spell checking turned on in nearly every program I enter text
into including `vim`, `InteliJ` and my browser. So I have always felt that what QGIS really needed was a way
to spell check maps before I printed them at A3 and put them on the wall.

Back in 2019 North Road wrote a i[blog post about custom layout checks
](https://north-road.com/2019/01/14/on-custom-layout-checks-in-qgis-3-6-and-how-they-can-do-your-work-for-you/)
and ended it with a throw away comment "It’d even be possible to hook into one of the available Python spell
checking libraries to write a spelling check!". I came across this when I was trying to see if there was an
easy way for my students (many of whom have English as a second language) to avoid handing in projects with
glaring (i.e. I can see them) spelling errors in the title. So I stuck the link on my backlog, until the
proverbial rainy day came along.

## Implementation

Obviously I'm the last person who should be allowed to write spell checking software, but the joy of open
source is that for things like this someone else has almost certainly already done it. So a quick duck-duck-go
found me installing `pyspellcheck` which seemed like it would do what I want. It has a pretty easy interface
in that once you've created a spell checker object, you can just pass in a list of words and it will return a
list of (probably) misspelled words and a method to give the most likely correction and another method to give
you list of other possibilities. Armed with this I could create a method to find and check all the text
elements of a print layout.

```py
@check.register(type=QgsAbstractValidityCheck.TypeLayoutCheck)
def layout_check_spelling(context, feedback):
layout = context.layout
results = []
checker = SpellChecker()

for i in layout.items():
if isinstance(i, QgsLayoutItemLabel):
text = i.currentText()
tokens = [word.strip(string.punctuation) for word in text.split()]
misspelled = checker.unknown(tokens)
for word in misspelled:
res = QgsValidityCheckResult()
res.type = QgsValidityCheckResult.Warning
res.title = 'Spelling Error?'
template = f"""
<strong>'{word}</strong>' may be misspelled, would
'<strong>{checker.correction(word)}</strong>' be a better choice?
"""
possibles = checker.candidates(word)
if len(possibles) > 1:
template += """
Or one of:<br/>
<ul>
"""
for t in possibles:
template += f"<li>{t}</li>\n"
template += '</ul>'
res.detailedDescription = template
results.append(res)
return results
```

And in theory, that was that! But I'm pretty sure that my students (and everyone else) probably didn't want to
cut and paste that into the console every time they wanted to spell check a map. So, I looked at how to
package this up for QGIS. I built a plugin (using the plugin builder tool), but then things got a little
tricky as I can't see any way for a plugin to add itself to the print layout rather than the main QGIS window
(please let me know if it is possible), and it seemed unintuitive to make people press a button in one window
to effect another one, besides the whole point of being a `QgsAbstractValidityCheck` was that the method is
automatically run on print. So I didn't need most of the plugin code or did I? On further thought I did, there
is a need for some GUI as the user can pick which language they want to use in the spell check. `pyspellcheck`
can spell check English, Spanish, French, Portuguese, German, Italian, Russian, Arabic, Basque, Latvian and
Dutch (so if those are your language then please test this for me). I also thought that providing the option
to supply a different to the default personal dictionary might be useful. So that made use of the dialog that
pops up when you hit the plugin.

But it turns out you can't register a class method as as a `QgsAbstractValidityCheck` since it gets confused
when QGIS calls it later. So I had to move my checker method outside the plugin class. But then I couldn't
access the language and dictionary that was set in the GUI! Some more searching gave me the following code:

```py
_instance = plugins['qgis-spellcheck']
checker = _instance.checker
```

Whereby I can pull out the named plugin and grab it's spell checker, which was created in the plugin's
`__init__` method. I seem to have a small issue that the user's profile is not set when that runs which messes
up where the personal dictionary is put (again if you know how to fix this let me know).


## Future Work

Ideally, I'd like the spell checker to scan and highlight the text in the boxes as I typed but I fear that is
beyond my understanding of the QGIS/Qt interface. Next highest on my wish list is for the list of spelling
issues to be non-modal so I can cut and paste fixes into the text box, rather than having to memorise the
correct spelling, close the window and then type it in (again answers on a github issue).

I'm sure all sorts of things will come up once people start using it, so as usual issues and PRs are welcome
at https://github.com/ianturton/qgis-spellcheck.

Binary file added images/geojson.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 44767bf

Please sign in to comment.