Skip to content

Stop missing event checker from checking entire range every time#221

Open
DobbyAHouseElf wants to merge 8 commits intoep1cman:mainfrom
DobbyAHouseElf:improve-missing-event-checker
Open

Stop missing event checker from checking entire range every time#221
DobbyAHouseElf wants to merge 8 commits intoep1cman:mainfrom
DobbyAHouseElf:improve-missing-event-checker

Conversation

@DobbyAHouseElf
Copy link
Copy Markdown
Contributor

Reduce the time range used for missing events checker to stop checking the entire range every time.
Once the missing events checker starts, the start of of the next time range will be the end of the current time range. This can be made earlier for reasons such as ongoing events and failed uploads.

I have tested failing uploads by throwing an error and checking to see if the start time has gone earlier for the next missing events checker.
I haven't tested ongoing events as it is hard to time an active event to trigger when the missing events checker is running.

start_time = self.start_time
end_time = datetime.now(timezone.utc)
# Set next start time to be the end of the times checked for this iteration
self.start_time = end_time
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be updated at the end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed an issue with this. If self.start_time is not updated straight away, the update_start_time method won't work correctly as it checks for if the new time is earlier than the current self.start_time. Also setting it at the end of this method could overwrite the self.start_time after a failed request has made it earlier.

There is currently no try in this method but it could potentially call update_start_time(old_start_time) on failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially a solution for if the missing event checker fails.
22fc8c8

Comment thread unifi_protect_backup/missing_event_checker.py Outdated
Comment thread unifi_protect_backup/unifi_protect_backup_core.py Outdated
@ep1cman
Copy link
Copy Markdown
Owner

ep1cman commented Jul 28, 2025

Apologies about the incoming essay, this was mostly me reasoning out what I expect to happen and what is currently happening. If you want me to take over implementing the below I am happy to do so!

Context

The way UFP's downloader works is that it retries N times in quick succession to download an event if it fails, in case it was an intermittent error (e.g a blip in network connectivity). Critically, the downloader is not responsible for reading failed events to the download queue, this is the job of the missing event checker. Effectively allowing the user to tune this interval by adjusting how frequently it runs.

Equally the uploader has no retry logic at all (we rely on rclone for handling rapid retries) and relies entirely on the missing event checker to reschedule a download, and thus an upload.

The downloader keeps track of how many times it has attempted to download the clip, after M failed attempts (so N*M download attempts) it will finally ignore the event entirely.

In the process of explaining this I have realised that if the downloads succeed but the uploads fail, the application with currently keep on trying to download and upload a clip. This is the cause of #202!

With this PR as is, both the downloader and uploader will cease to perform long term retries. I think a more fully rounded solution to managing failures is required.

New approach

In principal I like the approach taken here. Reducing the window of time we need to ask Protect for events for will limit the expensive operations required, and thus reduce the burden this application creates on Protect (which is not insignificant).

With these changes the missing event checker checks blocks of time between when it ran last and the most recent time for which all events have concluded. In these blocks of time we should end up with an authoritative list of all events that exist in Protect.

However one concern I have, is if events can pop into existence retroactively after we have checked. For example are AI detections always processed in real-time, with an event starting, being in progress as it occurs, and then ending. Or does protect go back in time and create new events retroactively (especially considering the AI detections can now be done externally on the AI key). I have a vague memory of this being the case for smart detections vs motion detections back when I started this project (motion detections were captured in real time and smart detections added after), but I might be misremembering.

For the above reason I suggest we add a reasonable buffer period to the time range we request from Protect, e.g 3H. This should be more than enough time for detections to be completed, while still significantly less expensive than asking for the whole retention/missing period.

This tool's intention is to take a relatively paranoid approach of trying to ensure, to the extent reasonably possible, that ALL clips are backed up.

If we are no longer querying the whole retention period, we should instead be able to keep a set of which events were previously missing. We could have the missing event tracker keep track of:

  • Which events are missing from the requested retention period
  • How many times we have added the event to the download queue
    • This will capture how many download and/or upload attempts have failed.

By taking this approach we no longer need to plumb the uploader or downloader into the missing event checker and we centralise the long term retry logic. We can strip the downloader's long term retry logic out, moving it to the missing event checker instead.

On each execution we need to:

  • Check the database for which events have been backed up (or ignored by adding them to the events table, with no corresponding entries in the backups table)
  • Check if any of our previously missing events have been backed up, if so remove them from our set.
  • Ask Protect for a list of events since our last know good time minus some buffer period and the current time
  • Check if any of these events are missing from the database if so add them to the set of missing events we are tracking
  • Check if any of these events are in the download/upload queues, if so do nothing for now
  • for each remaining event:
    • check how many times it has been re-attempted, if too many, ignore it permanently
    • add event to the download queue. Increment attempt count

@DobbyAHouseElf
Copy link
Copy Markdown
Contributor Author

I have had it running since the last commit i did on this pull request. I recently restarted the application after making sure there are no active events and on restart, there were no missed events picked up from the full range. Looking at the logs, i think this may have just been lucky.

I have multiple events which have failed to download 5 times. They have then been picked up by the missing event checker but only because there were other active events which were older than the failed events. I don't know if the active events were the cause of the failed downloads or it was random that there were active events when there was a failure (running on a cloud key). This pull request does at a minimum need an else here which pulls back the start_time for the next missing event checker.

For smart events vs motion events, they have all been pulling the start_time back for the next missing event checker. I don't have any AI events active so i don't know how they behave. I agree with adding a buffer period which could be more than 3 hours as if it is less than 500 events in that time, it will still be a single request so hopefully no added load.

I don't fully understand the approach you suggested near the end. Is this another table which will keep track of all events found from the missing event checker? Would this then mean no longer pulling the start_time back for failed events and instead pulling the events from the database and putting them back on failure? It could also have an earliest retry time with an exponential backoff so the downloader and uploader don't keep retrying too often.

I am happy for you to edit the code in this pull request or another one if needed. I might get a better understanding of it by looking at the code.

@ep1cman
Copy link
Copy Markdown
Owner

ep1cman commented Jul 28, 2025

The load imposed on Protect mostly comes from it having to perform database operations internally rather than the number of requests made. A single request over a long period takes noticeably longer than several requests for small periods.

I wasn't suggesting a new table in the database but rather just a Set of known missing event IDs stored in the missing event checker.

At start, it will fetch all events in the retention period and compare them against what is in the database. Any missing events are added to the Set.

The time based logic can remain exactly as is for checking future chunks of time for missing events (with the addition of a buffer period), but we no longer need to to update the time from within the downloader/uploader since failed event ids are now tracked by the missing event checker directly. We don't need to downloader/uploader to tell us they failed, we can infer this from the fact the ID doesn't appear in the database. This also removes the need for the data class that is being passed around.

In addition to checking the next chunk of time from Protect for missing events we also need to check if the events we know are missing have been added. If they haven't, and aren't currently sitting in the download/upload queues, increment the attempt count and add them to the download queue (which will propagate to the upload queue). Once the attempt count exceeds the configured amount, permanently ignore the event (meaning we can remove this logic from the Downloader)

@ep1cman
Copy link
Copy Markdown
Owner

ep1cman commented Jul 28, 2025

I will work on implementing this this evening and get your input

@DobbyAHouseElf
Copy link
Copy Markdown
Contributor Author

I understand the approach now. That makes sense. Thanks for explaining.

@ep1cman ep1cman marked this pull request as draft July 29, 2025 11:11
@ep1cman ep1cman self-requested a review July 29, 2025 11:11
@ep1cman ep1cman marked this pull request as ready for review July 29, 2025 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants