-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matching Google Camera RAW + JPEG #1
Comments
I can implement an ignore_string argument. When you specify "-01.COVER" and "-02.ORIGINAL" it will filter them from the filenames before matching duplicates. It's really easy the do but I wonder if that's the best approach. |
I started working on customizable criteria because I had a similar desire for more flexibility. I just opened #2, which is an approach to controlling the match logic with regex passed in as an env var. I did some tweaking of my preferred regex to also suit your example strings:
I'm very open to feedback and ideas. |
@ddbb07 I believe the change above has been pushed to ghcr.io/tenekev/immich-auto-stack:latest, if you want to give it a whirl. |
I made an effort to design my own regex, but I ran into another issue. Apparently Pixels record JPEG and RAW with different
Thus for Pixel files specifically, we can safely ignore the One way this could be addressed is to provide a list of possible criteria combinations and the program steps through the list and takes the first combination that doesn't fail (i.e. all regexes match). Then we could write a very narrow regex for Pixel files and fall back to a relaxed regex + timestamp for other files. Of course, creating a matching DSL of ever increasing complexity does seem like a slippery slope to madness so it's worth thinking about this carefully, I suppose. |
It's a very slippery slope, indeed. I believe you can still use a part of the If you split for example [
{
"key": "originalFileName",
"regex": {
"key": "YOUR REGEX",
"index": ?
}
},
{
"key": "localDateTime",
"split": {
"key": "T",
"index": 0
}
}
] |
Maybe we can even default to the date portion because it's very unlikely for someone to generate 10000 photos in a single day. And I can't think of any other scenario where the time portion matters. |
This is fun! Thanks for sharing your scenario. Thought 1: In my PR a few days ago, I added a flag to silently ignore files that do not match a regex, rather than halting the script. This would allow you to run the script twice with two different configs: it'd be cool to be able to define a config that could do the above in one pass (the fallback approach you suggested) but I think the two pass approach is a reasonable workaround for now. Thought 2: The value of the date check, in my opinion, is to for file formats like IMG_1234.jpg that are common across cameras. I have probably 10 IMG_1234.jpg files for the various cameras and phones I've owned over the past 15 years. The odds are very slim that more granularity is required than "date" (without time), so I support @tenekev's suggested default. One situation I could see this being a problem is if two family members bought new phones on the same day, which isn't terribly uncommon. In this case their photo ids may sync up occasionally over time. |
I think it's potentially a bit risky to discard the time portion of the datetime. @m3brown suggested unwanted collisions could happen if the numbers ever sync up on the same day. Under certain circumstances, I think it could happen often enough to be a bit annoying. For example, if multiple people go on a trip and bring cameras with the same brand. With a dedicated camera, it's quite possible to shoot hundreds of photos in a day (even thousands, although I've never hit 10k in a single day) so the chance of collision is not negligible (it happens to me every once in a while). For this reason, I'd be wary of using the date (only) as a default. For now, I've settled on having multiple copies of the |
I have not thought of multi-user instances but I guess we can solve this by adding
This should ensure that assets from different users aren't mixed although I find it highly unlikely that two users will generate the same filename in the same day. Another approach is to use regex on the As for the We can probably eliminate the split method entirely and rely on regex for everything. It might be too ambitious but building onto that we can even have multiple sets of criteria for different filename schemas. Every schema will match with a certain filename or fail silently to the next. |
I think multiple sets of criteria do make sense. It would break backward compatibility, but you could pull the structure into the YAML itself so it's not so hard to edit. E.g., if you wanted to merge photos from Sony cameras and Pixel phones separately, you could potentially write something like:
In this case, |
I really wanted to contain the whole service and its config in a single docker-compose block. And I really don't want to move from this idea. I've always preferred portable service. Meeting you halfway, we can do something like: environment:
CRITERIA_0_key: originalFileName
CRITERIA_0_regex: "DSC..."
CRITERIA_0_index: 1
CRITERIA_1_key: originalFileName
CRITERIA_1_regex: "PXL..."
CRITERIA_1_index: 1 |
Ah right, I forgot we have to pack it all into environment variables at the end of the day, which your format addresses of course. The one missing thing is that it has only has one dimension of indexing. Not sure if the index you wrote is enumerating separate criteria or the fields of a single criteria, but I think both levels are important? |
While dynamic environment variables could work, it sounds messy. Something like a while loop that expects some or all of keys, regex, and index to be defined. It'd be up to the user to ensure the numbers are sequential (i.e. we didn't skip from 2 to 4, missing 3). The current criteria parser expects json as a single environment variable, and docker-compose supports multi-line json. I imagine a rejiggering of the expected CRITERIA json schema would be equivalent to the yaml proposal.
The nested structure gets a little gnarly to navigate. Nested arrays seems a little intense when most people will only have one criteria block. But I think it is an accurate description! Another idea to make the structure a little flatter is to define the priority of each block. Below, there are two priority 1 so they join to make a set.
I'm sure we can come up with more ideas of the ideal schema. My biggest point is I think json could work well for the need. |
I thought so too but once you parse the variables with I even wrote a quick and dirty PoC but I've been too busy to do anything this week. import os, re, json
from jsonschema import validate, ValidationError
criteria_schema = {
"type": "object",
"properties": {
"key": {"type": "string"},
"regex": {"type": "string"},
"index": {"type": "integer"}
},
"required": ["key"],
"dependencies": {
"regex": ["index"]
}
}
def extract_criteria():
criteria_pattern = re.compile(r'^CRITERIA_(\d{1,3})_(\d{1,3})_(\w+)$')
criteria_dict = {}
for env_var, value in os.environ.items():
match = criteria_pattern.match(env_var)
if match:
priority, index, field = match.groups()
if priority not in criteria_dict:
criteria_dict[priority] = {}
if index not in criteria_dict[priority]:
criteria_dict[priority][index] = {}
criteria_dict[priority][index][field] = value
return criteria_dict
def validate_criteria(criteria_dict):
valid_criteria = [None for x in range(100)]
for priority_index, priority in criteria_dict.items():
valid_criteria[int(priority_index)] = []
for criteria_index, criteria in priority.items():
try:
criteria['index'] = int(criteria['index'])
except (ValueError, KeyError):
print(f"Invalid index for CRITERIA_{priority_index}_{criteria_index}: {criteria.get('index')}")
continue
try:
validate(instance=criteria, schema=criteria_schema)
valid_criteria[int(priority_index)].append(criteria)
except ValidationError as e:
print(f"Validation error for CRITERIA_{priority_index}_{criteria_index}: {e.message}")
if valid_criteria[int(priority_index)] == [] :
valid_criteria[int(priority_index)] = None
valid_criteria = [x for x in valid_criteria if x is not None]
return valid_criteria
if __name__ == "__main__":
os.environ["CRITERIA_0_0_key"] = "originalFileName"
os.environ["CRITERIA_0_0_regex"] = "DSC.*"
os.environ["CRITERIA_0_0_index"] = "1"
os.environ["CRITERIA_0_3_key"] = "localDateTime"
os.environ["CRITERIA_0_3_regex"] = "(\d\d\d\d-\d\d-\d\dT\d\d\:\d\d)\:\d\d\..+"
os.environ["CRITERIA_0_3_index"] = "1"
os.environ["CRITERIA_1_0_key"] = "originalFilerName"
os.environ["CRITERIA_1_0_regex"] = "PXrL.*"
os.environ["CRITERIA_1_0_index"] = "0"
# This criteria is missing a required key
os.environ["CRITERIA_5_0_regex"] = "PXrL.*"
os.environ["CRITERIA_5_0_index"] = "0"
# This criteria is missing a required index for its RegEx
os.environ["CRITERIA_17_3_key"] = "localDateTime"
os.environ["CRITERIA_17_3_regex"] = "(\d\d\d\d-\d\d-\d\dT\d\d\:\d\d)\:\d\d\..+"
# -----
extracted_criteria = extract_criteria()
valid_criteria = validate_criteria(extracted_criteria)
print(json.dumps(valid_criteria, indent=2)) Returns this: Validation error for CRITERIA_5_0: 'key' is a required property
Invalid index for CRITERIA_17_3: None
[
// Priority 0
[
// Criteria 0
{
"key": "originalFileName",
"regex": "DSC.*",
"index": 1
},
// Criteria 1 (but was named 3 in the env var)
{
"key": "localDateTime",
"regex": "(\\d\\d\\d\\d-\\d\\d-\\d\\dT\\d\\d\\:\\d\\d)\\:\\d\\d\\..+",
"index": 1
}
],
// Priority 1
[
// Criteria 0
{
"key": "originalFilerName",
"regex": "PXrL.*",
"index": 0
}
]
] This is how priority can be denoted: environment:
CRITERIA_0_0_key: originalFileName
# ^ ^ ^ ^ ^
# | | | | Value
# | | | Key
# | | Criteria Index
# | Priority Index
# Keyword that triggers the parser Does this get way too complicated? The main reason behind this effort is my concern about mixing yaml and json with multiline which I've always found a bit crummy. |
So I was thinking about this a bit more and I'm wondering if the so called "criteria index" is really necessary? Assuming users don't want to match different keys to each other or use the same key multiple times, we're free to use any arbitrary order for the fields of the criteria. In that case the order could just be lexicographic wrt the key names. That could make the configuration format simpler, both to read and write. Also, I wonder if there's some interest in shipping with some pre-written rules to match various camera brands? With criterion priority, we could write some very precise rules for each brand and be confident that there wouldn't be any conflicts. I'm happy to give rules for Pixels and Sony cameras and I assume we could eventually collect rules for all of the major brands. If we do get to that point, then I imagine most users wouldn't need to bother editing the config at all. They could just grab the rules for the cameras they own. |
Firstly, I like the idea of having recommended logic for common camera brands. Whether it's codified or just in the readme, it'd be helpful for the community using the library. I think we'd need to rely on community pull requests for folks to add configs once we set up the framework (in code or in docs). Secondly, I'm a little perplexed by a dual meaning for criteria index.
We'd need to be thoughtful of naming conventions if we have both criteria order index and regex match index. |
The current script doesn't work for Google Camera RAW + JPEG since it uses the following naming scheme
"PXL_20240808_195214323.RAW-01.COVER.jpg"
"PXL_20240808_195214323.RAW-02.ORIGINAL.dng"
These can be matched using only the suffixes "*-01.COVER.jpg" "*-02.ORIGINAL.dng"
I don't know how you would want this to be implemented since the current code only checks for file exensions.
The text was updated successfully, but these errors were encountered: