Stop missing event checker from checking entire range every time#221
Stop missing event checker from checking entire range every time#221DobbyAHouseElf wants to merge 8 commits intoep1cman:mainfrom
Conversation
| start_time = self.start_time | ||
| end_time = datetime.now(timezone.utc) | ||
| # Set next start time to be the end of the times checked for this iteration | ||
| self.start_time = end_time |
There was a problem hiding this comment.
This should only be updated at the end
There was a problem hiding this comment.
I noticed an issue with this. If self.start_time is not updated straight away, the update_start_time method won't work correctly as it checks for if the new time is earlier than the current self.start_time. Also setting it at the end of this method could overwrite the self.start_time after a failed request has made it earlier.
There is currently no try in this method but it could potentially call update_start_time(old_start_time) on failure.
There was a problem hiding this comment.
This is potentially a solution for if the missing event checker fails.
22fc8c8
|
Apologies about the incoming essay, this was mostly me reasoning out what I expect to happen and what is currently happening. If you want me to take over implementing the below I am happy to do so! ContextThe way UFP's downloader works is that it retries N times in quick succession to download an event if it fails, in case it was an intermittent error (e.g a blip in network connectivity). Critically, the downloader is not responsible for reading failed events to the download queue, this is the job of the missing event checker. Effectively allowing the user to tune this interval by adjusting how frequently it runs. Equally the uploader has no retry logic at all (we rely on rclone for handling rapid retries) and relies entirely on the missing event checker to reschedule a download, and thus an upload. The downloader keeps track of how many times it has attempted to download the clip, after M failed attempts (so N*M download attempts) it will finally ignore the event entirely. In the process of explaining this I have realised that if the downloads succeed but the uploads fail, the application with currently keep on trying to download and upload a clip. This is the cause of #202! With this PR as is, both the downloader and uploader will cease to perform long term retries. I think a more fully rounded solution to managing failures is required. New approachIn principal I like the approach taken here. Reducing the window of time we need to ask Protect for events for will limit the expensive operations required, and thus reduce the burden this application creates on Protect (which is not insignificant). With these changes the missing event checker checks blocks of time between when it ran last and the most recent time for which all events have concluded. In these blocks of time we should end up with an authoritative list of all events that exist in Protect. However one concern I have, is if events can pop into existence retroactively after we have checked. For example are AI detections always processed in real-time, with an event starting, being in progress as it occurs, and then ending. Or does protect go back in time and create new events retroactively (especially considering the AI detections can now be done externally on the AI key). I have a vague memory of this being the case for smart detections vs motion detections back when I started this project (motion detections were captured in real time and smart detections added after), but I might be misremembering. For the above reason I suggest we add a reasonable buffer period to the time range we request from Protect, e.g 3H. This should be more than enough time for detections to be completed, while still significantly less expensive than asking for the whole retention/missing period. This tool's intention is to take a relatively paranoid approach of trying to ensure, to the extent reasonably possible, that ALL clips are backed up. If we are no longer querying the whole retention period, we should instead be able to keep a set of which events were previously missing. We could have the missing event tracker keep track of:
By taking this approach we no longer need to plumb the uploader or downloader into the missing event checker and we centralise the long term retry logic. We can strip the downloader's long term retry logic out, moving it to the missing event checker instead. On each execution we need to:
|
|
I have had it running since the last commit i did on this pull request. I recently restarted the application after making sure there are no active events and on restart, there were no missed events picked up from the full range. Looking at the logs, i think this may have just been lucky. I have multiple events which have failed to download 5 times. They have then been picked up by the missing event checker but only because there were other active events which were older than the failed events. I don't know if the active events were the cause of the failed downloads or it was random that there were active events when there was a failure (running on a cloud key). This pull request does at a minimum need an else here which pulls back the start_time for the next missing event checker. For smart events vs motion events, they have all been pulling the start_time back for the next missing event checker. I don't have any AI events active so i don't know how they behave. I agree with adding a buffer period which could be more than 3 hours as if it is less than 500 events in that time, it will still be a single request so hopefully no added load. I don't fully understand the approach you suggested near the end. Is this another table which will keep track of all events found from the missing event checker? Would this then mean no longer pulling the start_time back for failed events and instead pulling the events from the database and putting them back on failure? It could also have an earliest retry time with an exponential backoff so the downloader and uploader don't keep retrying too often. I am happy for you to edit the code in this pull request or another one if needed. I might get a better understanding of it by looking at the code. |
|
The load imposed on Protect mostly comes from it having to perform database operations internally rather than the number of requests made. A single request over a long period takes noticeably longer than several requests for small periods. I wasn't suggesting a new table in the database but rather just a At start, it will fetch all events in the retention period and compare them against what is in the database. Any missing events are added to the The time based logic can remain exactly as is for checking future chunks of time for missing events (with the addition of a buffer period), but we no longer need to to update the time from within the downloader/uploader since failed event ids are now tracked by the missing event checker directly. We don't need to downloader/uploader to tell us they failed, we can infer this from the fact the ID doesn't appear in the database. This also removes the need for the data class that is being passed around. In addition to checking the next chunk of time from Protect for missing events we also need to check if the events we know are missing have been added. If they haven't, and aren't currently sitting in the download/upload queues, increment the attempt count and add them to the download queue (which will propagate to the upload queue). Once the attempt count exceeds the configured amount, permanently ignore the event (meaning we can remove this logic from the Downloader) |
|
I will work on implementing this this evening and get your input |
|
I understand the approach now. That makes sense. Thanks for explaining. |
…mprove-missing-event-checker
Reduce the time range used for missing events checker to stop checking the entire range every time.
Once the missing events checker starts, the start of of the next time range will be the end of the current time range. This can be made earlier for reasons such as ongoing events and failed uploads.
I have tested failing uploads by throwing an error and checking to see if the start time has gone earlier for the next missing events checker.
I haven't tested ongoing events as it is hard to time an active event to trigger when the missing events checker is running.