From 786182549e53d1893faaa8ed4dcf210dda00e75b Mon Sep 17 00:00:00 2001 From: KevinMidboe Date: Tue, 3 Oct 2017 11:11:12 +0200 Subject: [PATCH] Added speed tests and plans for better parsing and speed. --- knowledgeBase.md | 96 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) diff --git a/knowledgeBase.md b/knowledgeBase.md index f17c6a6..0f95342 100644 --- a/knowledgeBase.md +++ b/knowledgeBase.md @@ -469,3 +469,99 @@ scan_video(path): ``` ## Video + + +## Runtimes +At commit #30 we are walking through the directory with the function shown in core above. A run through The Office US of 201 episodes gives us a total runtime of 17.716. Ideas of what is slowing down the runtime: + + - Walking through the entire directory tree. + - Checking that it is a folder that exists. + - Guessing the episode name, number and info with the guessit library. + - Langdetect of a subtitle file. + +``` +Only scan: real 0m0.745s +Only subs: real 0m4.273s +Only videos: real 0m13.280s +``` + +Clearly something happening in video that takes time. +> Also more video objects that subs + +## Moving away from guessit +I wanted to check how accurate hits we could get with regex. The test is to compare the results from a simple reqex function with the output of guessit. Our code is the following: + +``` +def removeLeadingZero(number): + stringedNumber = str(number) + if (len(stringedNumber) > 1 and stringedNumber[0] == '0'): + return int(stringedNumber[1:]) + return int(number) + +class episode(object): + def __init__(self, path): + self.path = path + self.season = self.getSeasonNumber() + self.episode = self.getEpisodeNumber() + + def getSeasonNumber(self): + m = re.search('[sS][0-9]{1,2}', self.path) + if m: + seasonNumber = re.sub('[sS]', '', m.group(0)) + return removeLeadingZero(seasonNumber) + + def getEpisodeNumber(self): + m = re.search('[eE][0-9]{1,2}', self.path) + if m: + episodeNumber = re.sub('[eE]', '', m.group(0)) + return removeLeadingZero(episodeNumber) +``` + +With this we got: + +``` +seasonedParser:$ time ./scandir.py '/mnt/mainframe/shows/' +Total: 5926, missed was: 33 + +real 2m3.560s +user 1m43.832s +sys 0m0.840s +``` + +Our main misses where episodes with multiple episodes within. Examples follow: + +| Resolved | Filename | Manual guess | Reason for mismatch | +| --- | --- | --- | --- | +|[ ]| The.Office S03E24&25 - The Job [720p].mkv | 3 : 24 | Double episode | +|[ ]| Seinfeld.S07E21E22.The.Bottle.Deposit.720p.WEBrip.AAC.EN-SUB.x264-[MULVAcoded].mkv | 7 : 21 | Double episode | +|[ ]| Friends S10E17 E18.mkv | 10 : 17 | Double episode with spacing | +|[x]| S00E121.The.Seinfeld.Story.mkv | 0 : 12 | Special episode | +|[ ]| Brooklyn.Nine-Nine.S04E11-E12.The.Fugitive.Pt.1-2.1080p.WEB-DL.DD5.1.H264.mkv | 4 : 11 | Double episode | +|[ ]| Greys.Anatomy.S06E01.E02.720p.HDTV.x264.srt | 6 : 1 | Double episode | +|[ ]| Its.Always.Sunny.In.Philadelphia.S04E05E06.DSR.XviD-NoTV.avi | 4 : 5 | Multiple episode | +|[ ]| Chicago.PD.S02E20.Law.and.Order.SVU.S16E20.720p.HDTV.X264-DIMENSION[rarbg].mkv | 2 : 20 | Guessed wrong part | +|[ ]| 03x16 - The Excelsior Acquisition.avi | None | Separated by x | +|[ ]| new.girl.421.hdtv-lol.mp4 | None | No s or ep id chars + + +#### Excepts longer episode number +Except longer episode number, see *S00E121*. + +``` +def getEpisodeNumber(self): + m = re.search('[eE][0-9]{1,3}', self.path) + if m: + episodeNumber = re.sub('[eE]', '', m.group(0)) + return removeLeadingZero(episodeNumber) +``` + +Now we got 4 less misses + +``` +seasonedParser:$ time ./scandir.py '/mnt/mainframe/shows/' +Total: 5926, missed was: 29 + +real 2m0.766s +user 1m41.482s +sys 0m0.851s +``` \ No newline at end of file