Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching Google Camera RAW + JPEG #1

Open
ddbb07 opened this issue Aug 9, 2024 · 16 comments
Open

Matching Google Camera RAW + JPEG #1

ddbb07 opened this issue Aug 9, 2024 · 16 comments

Comments

@ddbb07
Copy link

ddbb07 commented Aug 9, 2024

The current script doesn't work for Google Camera RAW + JPEG since it uses the following naming scheme
"PXL_20240808_195214323.RAW-01.COVER.jpg"
"PXL_20240808_195214323.RAW-02.ORIGINAL.dng"

These can be matched using only the suffixes "*-01.COVER.jpg" "*-02.ORIGINAL.dng"
I don't know how you would want this to be implemented since the current code only checks for file exensions.

@tenekev
Copy link
Owner

tenekev commented Aug 9, 2024

I can implement an ignore_string argument. When you specify "-01.COVER" and "-02.ORIGINAL" it will filter them from the filenames before matching duplicates.

It's really easy the do but I wonder if that's the best approach.

@m3brown
Copy link
Contributor

m3brown commented Aug 16, 2024

I started working on customizable criteria because I had a similar desire for more flexibility. I just opened #2, which is an approach to controlling the match logic with regex passed in as an env var.

I did some tweaking of my preferred regex to also suit your example strings:

r"([A-Z]+[-_]?[0-9]{4,}([-_][0-9]{4,})?)([\._-].*)?\.[\w]{3,4}$"

image

I'm very open to feedback and ideas.

@m3brown
Copy link
Contributor

m3brown commented Aug 17, 2024

@ddbb07 I believe the change above has been pushed to ghcr.io/tenekev/immich-auto-stack:latest, if you want to give it a whirl.

@PythonNut
Copy link

I made an effort to design my own regex, but I ran into another issue. Apparently Pixels record JPEG and RAW with different localDateTimes. This causes the matching to fail if using the localDateTime which seems important to keep for, e.g. Sony ARWs. Fortunately it's at least possible to figure this out because:

  1. The Pixel does at least give the same base file name.
  2. The base name includes the localDateTime of the JPEG (in both cases) to millisecond precision.

Thus for Pixel files specifically, we can safely ignore the localDateTime and match based only on the file name. But of course, we need localDateTime for other files so there's no way to configure things correctly right now.

One way this could be addressed is to provide a list of possible criteria combinations and the program steps through the list and takes the first combination that doesn't fail (i.e. all regexes match). Then we could write a very narrow regex for Pixel files and fall back to a relaxed regex + timestamp for other files.

Of course, creating a matching DSL of ever increasing complexity does seem like a slippery slope to madness so it's worth thinking about this carefully, I suppose.

@tenekev
Copy link
Owner

tenekev commented Aug 22, 2024

It's a very slippery slope, indeed.

I believe you can still use a part of the localDateTime parameter. I included it initially because most cameras have a 4-digit sequence of 9999 and then they reset back to 0. If you've shot 40k photos, you can potentially encounter the same filename 4 times. localDateTime ensures they aren't stacked together.

If you split for example "localDateTime": "2024-07-22T15:04:28.000Z" on the T, you can use just the date portion. Here is an example:

[
  {
    "key": "originalFileName",
    "regex": {
      "key": "YOUR REGEX",
      "index": ?
    }
  },
  {
    "key": "localDateTime",
    "split": {
      "key": "T",
      "index": 0
    }
  }
]

@tenekev
Copy link
Owner

tenekev commented Aug 22, 2024

Maybe we can even default to the date portion because it's very unlikely for someone to generate 10000 photos in a single day. And I can't think of any other scenario where the time portion matters.

@m3brown
Copy link
Contributor

m3brown commented Aug 22, 2024

This is fun! Thanks for sharing your scenario.

Thought 1: In my PR a few days ago, I added a flag to silently ignore files that do not match a regex, rather than halting the script. This would allow you to run the script twice with two different configs:
1. Regex whose prefix would only match pixel images (e.g. starts with PXL), with no date time check
2. Default (or similar) config to capture all the remaining files

it'd be cool to be able to define a config that could do the above in one pass (the fallback approach you suggested) but I think the two pass approach is a reasonable workaround for now.

Thought 2: The value of the date check, in my opinion, is to for file formats like IMG_1234.jpg that are common across cameras. I have probably 10 IMG_1234.jpg files for the various cameras and phones I've owned over the past 15 years.

The odds are very slim that more granularity is required than "date" (without time), so I support @tenekev's suggested default.

One situation I could see this being a problem is if two family members bought new phones on the same day, which isn't terribly uncommon. In this case their photo ids may sync up occasionally over time.

@PythonNut
Copy link

I think it's potentially a bit risky to discard the time portion of the datetime. @m3brown suggested unwanted collisions could happen if the numbers ever sync up on the same day. Under certain circumstances, I think it could happen often enough to be a bit annoying. For example, if multiple people go on a trip and bring cameras with the same brand. With a dedicated camera, it's quite possible to shoot hundreds of photos in a day (even thousands, although I've never hit 10k in a single day) so the chance of collision is not negligible (it happens to me every once in a while). For this reason, I'd be wary of using the date (only) as a default.

For now, I've settled on having multiple copies of the immich-auto-stack container running at the same time using SKIP_MATCH_MISS: True. It's not elegant, but it's effectively very similar to the fallback list I suggested and it does not require any manual intervention, which is nice.

@tenekev
Copy link
Owner

tenekev commented Aug 22, 2024

One situation I could see this being a problem is if two family members bought new phones on the same day...

I have not thought of multi-user instances but I guess we can solve this by adding ownerId as the primary criteria.

ownerId > originalFileName.split('.')[0] > `localDateTime.split('T')[0]

This should ensure that assets from different users aren't mixed although I find it highly unlikely that two users will generate the same filename in the same day.

Another approach is to use regex on the localDateTime string instead of a simple split method. If we can capture everything but the seconds, yyyy-MM-ddTHH:mm, the chances of two assets with the same name within a minute are extremely small.

As for the SKIP_MATCH_MISS functionality, I think it needs some more explaining. I think it's a good idea but I wasn't entirely sure how to explain it in the readme.

We can probably eliminate the split method entirely and rely on regex for everything. originalFileName.split('.')[0] doesn't always work as intended - @ddbb07 's sample filenames contain more than one dot, resulting in unexpected results that just happen to work. Relying on regex only will streamline the process.

It might be too ambitious but building onto that we can even have multiple sets of criteria for different filename schemas. Every schema will match with a certain filename or fail silently to the next.

@PythonNut
Copy link

I think multiple sets of criteria do make sense. It would break backward compatibility, but you could pull the structure into the YAML itself so it's not so hard to edit. E.g., if you wanted to merge photos from Sony cameras and Pixel phones separately, you could potentially write something like:

- CRITERIA: 
  - - key: originalFileName
      regex: "DSC..."
      index: 1
    - key: localDateTime

  - - key: originalFileName
      regex: "PXL..."
      index: 1

In this case, SKIP_MATCH_MISS would presumably come into play after all supplied patterns fail to match.

@tenekev
Copy link
Owner

tenekev commented Sep 5, 2024

I really wanted to contain the whole service and its config in a single docker-compose block. And I really don't want to move from this idea. I've always preferred portable service.

Meeting you halfway, we can do something like:

environment:
  CRITERIA_0_key: originalFileName
  CRITERIA_0_regex: "DSC..."
  CRITERIA_0_index: 1
  CRITERIA_1_key: originalFileName
  CRITERIA_1_regex: "PXL..."
  CRITERIA_1_index: 1

@PythonNut
Copy link

Ah right, I forgot we have to pack it all into environment variables at the end of the day, which your format addresses of course. The one missing thing is that it has only has one dimension of indexing. Not sure if the index you wrote is enumerating separate criteria or the fields of a single criteria, but I think both levels are important?

@m3brown
Copy link
Contributor

m3brown commented Sep 5, 2024

While dynamic environment variables could work, it sounds messy. Something like a while loop that expects some or all of keys, regex, and index to be defined. It'd be up to the user to ensure the numbers are sequential (i.e. we didn't skip from 2 to 4, missing 3).

The current criteria parser expects json as a single environment variable, and docker-compose supports multi-line json. I imagine a rejiggering of the expected CRITERIA json schema would be equivalent to the yaml proposal.

environment:
  CRITERIA: > 
    [
      [
        {
          "key": "originalFileName",
          "regex": "DSC...",
          "index": 1
        },
        {
          "key": "localDateTime"
        }
      ],
      [
        {
          "key": "originalFileName",
          "regex": "PXL...",
          "index": 1
        }
      ]
    ]

The nested structure gets a little gnarly to navigate. Nested arrays seems a little intense when most people will only have one criteria block. But I think it is an accurate description!

Another idea to make the structure a little flatter is to define the priority of each block. Below, there are two priority 1 so they join to make a set.

environment:
  CRITERIA: > 
    [
      {
        "key": "originalFileName",
        "regex": "DSC...",
        "regex_index": 1,
        "priority": 1
      },
      {
        "key": "localDateTime",
        "priority": 1,
      },
      {
        "key": "originalFileName",
        "regex": "PXL...",
        "regex_index": 1,
        "priority": 2
      }
    ]

I'm sure we can come up with more ideas of the ideal schema. My biggest point is I think json could work well for the need.

@tenekev
Copy link
Owner

tenekev commented Sep 9, 2024

While dynamic environment variables could work, it sounds messy.

I thought so too but once you parse the variables with CRITERIA_ keyword into a json structure, you can validate every object against 2-3 rules and spit out any inconsistencies or ignore them altogether. They don't even have to be sequentially indexed. As long as x and y in CRITERIA_x_y_key are integers, they are ordered.

I even wrote a quick and dirty PoC but I've been too busy to do anything this week.

import os, re, json
from jsonschema import validate, ValidationError

criteria_schema = {
  "type": "object",
  "properties": {
    "key": {"type": "string"},
    "regex": {"type": "string"},
    "index": {"type": "integer"}
  },
  "required": ["key"],
  "dependencies": {
    "regex": ["index"]
  }
}

def extract_criteria():
  criteria_pattern = re.compile(r'^CRITERIA_(\d{1,3})_(\d{1,3})_(\w+)$')
  criteria_dict = {}

  for env_var, value in os.environ.items():
    match = criteria_pattern.match(env_var)
    
    if match:
      priority, index, field = match.groups()

      if priority not in criteria_dict:
        criteria_dict[priority] = {}
      
      if index not in criteria_dict[priority]:
        criteria_dict[priority][index] = {}
      
      criteria_dict[priority][index][field] = value

  return criteria_dict

def validate_criteria(criteria_dict):
  valid_criteria = [None for x in range(100)]
  
  for priority_index, priority in criteria_dict.items():
    valid_criteria[int(priority_index)] = []
  
    for criteria_index, criteria in priority.items():
      
      try:
        criteria['index'] = int(criteria['index'])
      
      except (ValueError, KeyError):
        print(f"Invalid index for CRITERIA_{priority_index}_{criteria_index}: {criteria.get('index')}")
        continue

      try:
        validate(instance=criteria, schema=criteria_schema)
        valid_criteria[int(priority_index)].append(criteria)
      
      except ValidationError as e:
        print(f"Validation error for CRITERIA_{priority_index}_{criteria_index}: {e.message}")
    
    if valid_criteria[int(priority_index)] == [] :
      valid_criteria[int(priority_index)] = None

  valid_criteria = [x for x in valid_criteria if x is not None]
  
  return valid_criteria

if __name__ == "__main__":
  
  os.environ["CRITERIA_0_0_key"] = "originalFileName"
  os.environ["CRITERIA_0_0_regex"] = "DSC.*"
  os.environ["CRITERIA_0_0_index"] = "1"
  os.environ["CRITERIA_0_3_key"] = "localDateTime"
  os.environ["CRITERIA_0_3_regex"] = "(\d\d\d\d-\d\d-\d\dT\d\d\:\d\d)\:\d\d\..+"
  os.environ["CRITERIA_0_3_index"] = "1"
  os.environ["CRITERIA_1_0_key"] = "originalFilerName"
  os.environ["CRITERIA_1_0_regex"] = "PXrL.*"
  os.environ["CRITERIA_1_0_index"] = "0"

  # This criteria is missing a required key
  os.environ["CRITERIA_5_0_regex"] = "PXrL.*"
  os.environ["CRITERIA_5_0_index"] = "0"
  
  # This criteria is missing a required index for its RegEx
  os.environ["CRITERIA_17_3_key"] = "localDateTime"
  os.environ["CRITERIA_17_3_regex"] = "(\d\d\d\d-\d\d-\d\dT\d\d\:\d\d)\:\d\d\..+"

  # -----

  extracted_criteria = extract_criteria()
  valid_criteria = validate_criteria(extracted_criteria)

  print(json.dumps(valid_criteria, indent=2))

Returns this:

Validation error for CRITERIA_5_0: 'key' is a required property
Invalid index for CRITERIA_17_3: None
[
 // Priority 0
  [ 
    // Criteria 0
    {
      "key": "originalFileName",
      "regex": "DSC.*",
      "index": 1
    },
    // Criteria 1 (but was named 3 in the env var)
    {
      "key": "localDateTime",
      "regex": "(\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d\\:\\d\\d)\\:\\d\\d\\..+",
      "index": 1
    }
  ],
  // Priority 1
  [
    // Criteria 0
    {
      "key": "originalFilerName",
      "regex": "PXrL.*",
      "index": 0
    }
  ]
]

This is how priority can be denoted:

environment:
  CRITERIA_0_0_key: originalFileName
#     ^    ^ ^  ^          ^
#     |    | |  |          Value
#     |    | |  Key 
#     |    | Criteria Index
#     |    Priority Index
#     Keyword that triggers the parser

Does this get way too complicated? The main reason behind this effort is my concern about mixing yaml and json with multiline which I've always found a bit crummy.

@PythonNut
Copy link

So I was thinking about this a bit more and I'm wondering if the so called "criteria index" is really necessary? Assuming users don't want to match different keys to each other or use the same key multiple times, we're free to use any arbitrary order for the fields of the criteria. In that case the order could just be lexicographic wrt the key names. That could make the configuration format simpler, both to read and write.

Also, I wonder if there's some interest in shipping with some pre-written rules to match various camera brands? With criterion priority, we could write some very precise rules for each brand and be confident that there wouldn't be any conflicts. I'm happy to give rules for Pixels and Sony cameras and I assume we could eventually collect rules for all of the major brands. If we do get to that point, then I imagine most users wouldn't need to bother editing the config at all. They could just grab the rules for the cameras they own.

@m3brown
Copy link
Contributor

m3brown commented Sep 20, 2024

Firstly, I like the idea of having recommended logic for common camera brands. Whether it's codified or just in the readme, it'd be helpful for the community using the library. I think we'd need to rely on community pull requests for folks to add configs once we set up the framework (in code or in docs).

Secondly, I'm a little perplexed by a dual meaning for criteria index.

  • I implemented the regex functionality currently in place, where "index will select a substring using re.match.group(index)" (source: README).
    • it's possible that 100% of people want index 1 and this can be removed.
  • In the thread above, index is used for ordering. For this scenario, the alphabetical workaround is reasonable, as is the json approach (where lists are inherently ordered).

We'd need to be thoughtful of naming conventions if we have both criteria order index and regex match index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants