KevinMidboe/seasonedParser

Fork 0

Files

KevinMidboe bbb20af93a Added more thoughs around hashing and indexing content.

2017-09-28 13:43:21 +02:00

14 KiB

Raw Blame History

seasonedGuesser

The following is a description of the optimal flow for discovering mediafiles in a directory.

Detect changes in a directory
Find the files
Analyze the files
What to do with this information
- Different ways we can say a file is interesting
The different ways we want to accept data
Scan vs Convert
Plex Local Media Assets
Alternatives to Run-Options
Monitor End-to-End Movement of Files

Detect changes in a directory

There should be a daemon running to check for changes in the hash for a directory.

Find the files

# dir = somedir
os.list(dir)

What information do we want:

Name of the files
Hierarchy of the files

Runtime of os.list() is the real problem with the system. When collecting from network drive the speed is super slow. Need a better way to get the directory contents.

Analyze the files

Step through every file. What information do we want:

What the name of the file is
Can use guessit (speed issue)
We want to know what category of file it is
- Movie
- Episode
- Season folder
- Directory
- Subtitles
- Trash files

What to do with this information

When we know it is a movie, subtitles or episode, create a object for the type.
If it is a directory, subtitles

Different ways we can say a file is interesting

If it is in a directory that has enough information
If it has enough information in itself to know what it is and where it belongs.

The reason we need to check this is because if a directory:

Twin Peaks Season 1 1080p WEB-DL DD5.1/
├── Twin Peaks S01E01
│   ├── Twin Peaks S01E01 Pilot.en.srt
│   └── Twin Peaks S01E01 Pilot.mkv
├── Twin Peaks S01E02
│   ├── Twin Peaks S01E02 Traces to Nowhere.en.srt
│   └── Twin Peaks S01E02 Traces to Nowhere.mkv

What do we do here?

Do we disregard the folder name and look at the files and say that we have one mediafile [.mkv] and one subtitles file [.srt] and therefore we say we have everything we need for a element.
Or do we look at the folder name and say say that the files inside are most likely correct and without looking at the information within the name of each element, just rename the files based on the name of the parent folder.

If we save the parent directory file name and the information within, then if the file does not have enough information we can check with the parent folder if we can extract more information about the item.

Sidenote, this can be depremental if the file we are looking at is a sample or someother trash. Then we can accidentaly select a file that is unusefull to us.

The different ways we want to accept data

Movies

A movie that is within a folder of the same name:

Interstellar.2014.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1/
├── Interstellar.2014.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1.eng.srt
└── Interstellar.2014.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1.mkv

A movie that is standalone:

Interstellar.2014.1080p.BluRay.REMUX.AVC.DTS-HD.MA.5.1.mkv

A movie with extras:

Swiss.Army.Man.2016.Bluray.1080p.TrueHD-7.1.Atmos.x264-Grym/
├── Swiss.Army.Man.2016.Bluray.1080p.TrueHD-7.1.Atmos.x264-Grym.mkv
├── Swiss.Army.Man.Extras-Grym
│   ├── Behind.the.Scenes-Grym.mkv
│   ├── Deleted.Scenes-Grym.mkv
│   ├── Making.Manny-Grym.mkv
│   └── Q.and.A.Session.with.the.Filmmakers-Grym.mkv
├── Torrent downloaded from demonoid.ph.txt
└── Torrent downloaded from......txt

Show w/ complete season folder

A shows complete season folder with separate folder:

Community.720p.1080p.WEB-DL.DD5.1.H.264/S03/
├── Community S03E01
│   └── Community S03E01 Biology 101.mkv
├── Community S03E02
│   ├── Community S03E02 Geography of Global Conflict.en.srt
│   └── Community S03E02 Geography of Global Conflict.mkv
├── Community S03E03
│   ├── Community S03E03 Competitive Ecology.en.srt
│   └── Community S03E03 Competitive Ecology.mkv

A shows complete season folder without separate folders:

Penn and Teller Fool Us S01 WEB-DL x264-FUM[ettv]
├── Penn.and.Teller.Fool.Us.S01E01.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E02.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E03.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E04.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E05.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E06.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E07.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01E08.WEB-DL.x264-FUM.mp4
├── Penn.and.Teller.Fool.Us.S01.Special.WEB-DL.x264-FUM.mp4
└── Torrent-Downloaded-From-extratorrent.cc.txt

Show with single episode

A shows episode in a separate folder:

Twin.Peaks.S03E17.1080p.WEB.H264-STRiFE[rarbg]/
├── RARBG.txt
├── twin.peaks.s03e17.1080p.web.h264-strife.mkv
└── twin.peaks.s03e17.1080p.web.h264-strife.nfo

Plex Local Media Assets

Enable "Local Media Assets"

Because I use plex and it is per date the leading platfor for multimedia library hosting we have decided to follow its naming scheme.
"Local Media Assets" is an Agent source that loads local media files or embedded metadata for a media item. To do this, ensure the Agent source is enabled and topmost in the list:

Launch the Plex Web App
Choose Settings from the top right of the Home screen
Select your Plex Media Server from the horizontal list
Choose Agents
Choose the Library Type and Agent you want to change
Ensure Local Media Assets is checked
Ensure Local Media Assets is topmost in the list

Extra Subtitle Files

Several formats of subtitle files are supported and can be picked up by the Local Media Assets scanner:

.srt
.smi
.ssa (or .ass)

Other formats such as VOBSUB, PGS, etc. may work on some Plex Apps but not all. If you use the Universal Transcoder, both VOBSUBS and PGS subtitles will be "burned in" during the transcoding process and shown.

Subtitle files need to be named as follows:

MovieName (Release Date).[Language_Code].ext
Movies/MovieName (Release Date).[Language_Code].ext
Movies/MovieName (Release Date).[Language_Code].forced.ext

Local Trailers and Extras

If you have trailers, interviews, behind the scenes videos, or other "extras" type content for your movies, you can add those.

Organized in Subdirectories

You can organize your local extras into specific subdirectories inside the main directory named for the movie. Extras will be detected and used if named and stored as follows:

Movie/MovieName (Release Date)/Extra_Directory_Type/Descriptive_name.ext

Where Extra_Directory_Type is one of:

Behind The Scenes
Deleted Scenes
Featurettes
Interviews
Scenes
Shorts
Trailers

It is recommended that you provide some sort of descriptive name for the extras filenames.

Swiss Army Man (2016)/
├── Behind The Scenes
│   └── Behind the Scenes (Local).mkv
├── Deleted Scenes
│   └── Deleted Scenes (Local).mkv
├── Featurettes
│   ├── Making Of (Local).mkv
│   └── Q and A Session with the Filmmakers (Local).mkv
└── Swiss.Army.Man.2016.Bluray.1080p.TrueHD-7.1.Atmos.x264.mkv

How to group together items

Hashes are our friend! We want to take the minimal amount of separatly identifying information and hash it to a index value. This will in effect become a hash table.

Movies: Movie name and release year.
Shows: Series name, season number and episode number.

user@hostname:/$ echo 'interstellar.2017' | sha1sum
4ecc56e9bb3d0ef4b0b48cbe14f78974ea24ab35

user@hostname:/$ echo 'new girl.2.17' | sha1sum
bb1c1339fa4211f65013f3ce36004253cc89fe04

Separate items with '.'.
That is; '.'.join([series_name, season, episode])

>>> series_name='new girl'
>>> season=2
>>> episode=17
>>> '.'.join([series_name, str(season), str(episode)])

'new girl.2.17'

NB: Episode and season number should NOT have a leading 0 here!

Hashing episode in python

show = 'Rick and morty'.lower()
season = 3
for ep in range(1,10):
	itemConcat = '.'.join([show, str(season), str(ep)])
	hash_object = hashlib.sha1(str.encode(itemConcat))
	hex_dig = hash_object.hexdigest()
	print('%s : %s' % (hex_dig, itemConcat))

What information should a hash index contain?

For a show item, we would hash the name of the show, season number and episode number. What information do we want to keep about a item?

Show episode

The full name of the file.
What show
Season
Episode
[Name of episode]

Movie

The full name of the file
Movie name
Year
Extra?

Subtitles

The full name of the file
Language
SDH?

To move the item we just need the hash, and append all the other information.

What we also can do with the hash/information problem

Ok, so the problem is that we really just want one class per folder. That means that having a separate subtitles goes againts this. Me reasoning for having a class pr folder type is that then everything within a hash index could have the same structure. Having everything in a single class means that we only need to do one uniform pass over our tree to execute all the operations needed (move and rename).

Show

.
├── 1c133
└── f3ce3
    ├── this.name: New Girl
    ├── this.season: 2
    ├── this.episode: 17
    └── this.objects
    │   ├── episode:
    │   │   ├── full name of path
    │   │   └── [name of the episode]
    │   └── subtitle:
    │       ├── full name of path
    │       ├── language
    │       └── SDH?

Scan vs Convert

We are thinking there should be two main blobs. There should be one for the run cycle, when the new information is found, and one for the elements that have been handled.

Wait! This would mean that we need to move the information.

Looking up names for episodes

The tmdb api link can get extended information about a episode number.

import tvdb_api
t = tvdb_api.Tvdb()
episode = t['Rick and morty'][3][4] # get season 1, episode 3 of show
print(episode['episodename']) # Print episode name

A large stall in the system would be to do a http call to the tvdb api to get the episode name every time we run seasoned. What we could do is when we find a episode from a series we can look for and cache all episode names that are that season for the series.

This can be saved in the blob in the hash location for the episode. This means we can make a hash table insertion without having the episode yet.

Alternatives to Run-Options

seasoned parse : Looks through the saved directory and looks for mediafiles to match

--dry : should not commit any of the changes, just print them out.
--type : options movie | show for looking for a specific type of content.
Something to do with subtitles.
Something to do with looking up the name of the episode on tmdb.

seasoned discover :

Monitor End-to-End Movement of Files

Using watchdog can give us verification that something we wanted to happen acctually has happened.

watchdog.events.DirMovedEvent

If this is to be used we need a way to check the output of watchdog eventHandler very fast after we have done a action. Such actions may be, but not limited to, renaming a file, creating a directory or moving files from old dir to new.

Event queues and emitters

Event queue can be what we are looking at to verify that an event has happened. link

class watchdog.observers.api.EventQueue(maxsize=0)

Thread-safe event queue based on a special queue that skips adding the same event (FileSystemEvent) multiple times consecutively. Thus avoiding dispatching multiple event handling calls when multiple identical events are produced quicker than an observer can consume them.

watchdog.observers.api.EventEmitter(event_queue, watch, timeout=1)

Producer thread base class subclassed by event emitters that generate events and populate a queue with them.

14 KiB Raw Blame History