AI Blitz #6

Work in progress about video transcriptions

Capture images from video and compare them to identify moves


What to do with videos?

  • Motivation

After playing with the 4 first image puzzles (see my first Notebook here, with around 99% accuracy submissions)), it's time to face the last (but not least) puzzle about video transcription.

As it's a new field for me, I started some web researches, and here is where I am. One direction I found is to capture several images from each video, and analyse these images.

--> Could this bring back us to image model?

--> Could I use the FEN Notation transcription model to compare pictures?

  • Context

We have access to short video (around 1 seconde) of a chessboard with some moving pieces (around 4-8 moves). The objective is to translate the moves from each video.

  Context
In [63]:
## - connect
!pip install --upgrade fastai git+https://gitlab.aicrowd.com/yoogottamk/aicrowd-cli.git >/dev/null
%load_ext aicrowd.magic

%aicrowd login
  Running command git clone -q https://gitlab.aicrowd.com/yoogottamk/aicrowd-cli.git /tmp/pip-req-build-u393pd18
The aicrowd.magic extension is already loaded. To reload it, use:
  %reload_ext aicrowd.magic
Verifying API Key...
API Key valid
Saved API Key successfully!

Import Packages 📦

In [64]:
## - librairies
import cv2     # for capturing videos
import math   # for mathematical operations
import matplotlib.pyplot as plt    # for plotting the images
import matplotlib.image as mpimg
%matplotlib inline
import pandas as pd
from keras.preprocessing import image   # for preprocessing the images
import numpy as np    # for mathematical operations
from keras.utils import np_utils
from skimage.transform import resize   # for resizing images
from sklearn.model_selection import train_test_split
from glob import glob
from tqdm import tqdm
import string

# for model building
import keras
from keras.models import Sequential
from keras.applications.vgg16 import VGG16
from keras.layers import Dense, InputLayer, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from skimage import io, transform
from skimage.util.shape import view_as_blocks

Access Data ♚♕♜♘♝♙

In [ ]:
## - data
%aicrowd dataset download --challenge chess-transcription -j 3

!mkdir data
!mkdir data/video
!unzip train.zip  -d data/video/ 
!unzip val.zip -d data/video/ 
!unzip test.zip  -d data/video/ 

!mv train.zip data/video/train.zip
!mv train.csv data/video/train.csv
!mv val.csv data/video/val.csv
!mv val.zip data/video/val.zip
!mv test.zip data/video/test.zip
!mv sample_submission.csv data/video/sample_submission.csv
In [65]:
video_df = pd.read_csv("data/video/train.csv")
video_df['VideoID'] = video_df['VideoID'].astype(str)+".mp4"
VideoID label
0 0.mp4 e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
1 1.mp4 c2b3 e1f1 f7e7 a3a4 e7e3 a4a5
2 2.mp4 f1c1 h6h7 h4f3 h2g4 f6g6 g4f6 c1g1 f6d7 g6f5
3 3.mp4 h4h5 a7b5 c4b2 d7e6 g1h3 h6f7
4 4.mp4 c6g6 b1c2 g6g5 a2b2 h4f5 e4d3
... ... ...
4995 4995.mp4 f2e1 b5b8 d1d2 c5c4 d2e2 b8b4
4996 4996.mp4 c7e6 h8b2 d5d4 c4c3 d4e4 c3d2 c8a7 h4g2 b6c6
4997 4997.mp4 g4e5 c5b4 f5b1 b4a5 b6a5 e3e4 g5h4 c4e5 h4g3
4998 4998.mp4 h7h5 h1e1 h8a8 e3d2 e7f7 a4a6 a8f8 e5f7
4999 4999.mp4 c1a1 g5h4 g2h2 d5c4 d2f3 b5b4 e3f4 d8e8

5000 rows ร— 2 columns

In [ ]:
video_dftest = pd.read_csv("data/video/sample_submission.csv")
video_dftest['VideoID'] = video_dftest['VideoID'].astype(str)+".mp4"
video_dftest['label'] = ''

From video to images 🎥📸

The aim is to transform the video into several images, as I should be easier to analysis these new images than the video itself. Most of the time, a movie contains 24 images per second of video. It could too much for this problem, maybe only 5 images could be enough.

For a starting point, we will try to decompose each video into 2 pictures (stored in a new folder), one at the beginning of the video, and a second at the end. A same label, the one of the original video, will be display for each image associated.

In [ ]:
!rm -rf data/video/train_frame
!mkdir data/video/train_frame
In [ ]:
for i in tqdm(range(8)):
#for i in tqdm(range(video_df.shape[0])):
  count = 0
  videoFile = video_df['VideoID'][i]
  cap = cv2.VideoCapture('data/video/train/' + videoFile)  # capturing the video from the given path
  frameRate = 2 #cap.get(5)                                   # frame rate
  x = 1
      frameId = cap.get(1)                                 # current frame number
      ret, frame = cap.read()
      if (ret != True):
          print ('break')
      if (frameId % math.floor(frameRate) == 0):
          print ('store')
          filename ='data/video/train_frame/' + videoFile + "_frame%d.jpg" % count;count+=1
          cv2.imwrite(filename, frame)                      # storing the frames

print ("Done!")

Let's have a look at images resulting of the first video: we have now the start and the end view of the video. Could be enough to understand moves? I think so.

In [ ]:
f, axarr = plt.subplots(1,4, figsize=(15, 15))

for i in range(0,4):
  if(i == 1):
    axarr[i].set_title(video_df['label'][0], fontsize=12, pad=5)
  axarr[i].imshow(mpimg.imread('data/video/train_frame/0.mp4_frame' + str(i) + '.jpg'))
In [ ]:
f, axarr = plt.subplots(1,10, figsize=(20, 20))

for i in range(0,10):
  if(i == 2):
    axarr[i].set_title(video_df['label'][7], fontsize=12, pad=5)
  axarr[i].imshow(mpimg.imread('data/video/train_frame/7.mp4_frame' + str(i) + '.jpg'))

In order to associate the right label with the corresponding images, we create a new train dataset.

In [ ]:
# - getting the names of all the images
images = glob("data/video/train_frame/*.jpg")
video_name = []
image_name = []
frame_number = []
video_label = []

for i in tqdm(range(len(images))):
    # - creating the image name
    imageName = images[i].split('/')[3]
    # - creating the image label
    videoName = images[i].split('/')[3].split('_')[0]
    frameNb = images[i].split('/')[3].split('_')[1].split('.')[0][5:]
    frameNb = int(frameNb)
    videoLabel = video_df[video_df['VideoID'] == videoName]['label'].iloc[0]

# - storing the images and their class in a dataframe
image_df = pd.DataFrame()
image_df['VideoID'] = video_name
image_df['frame'] = frame_number
image_df['ImageID'] = image_name
image_df['label'] = video_label

# - converting the dataframe into csv file 
image_df.to_csv('data/video/image_df.csv', header = True, index = False)

image_df = image_df.sort_values(['VideoID', 'frame'])
image_df.index = range(image_df.shape[0])
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 91/91 [00:00<00:00, 967.63it/s]
Out[ ]:
VideoID frame ImageID label
0 0.mp4 0 0.mp4_frame0.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
1 0.mp4 1 0.mp4_frame1.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
2 0.mp4 2 0.mp4_frame2.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
3 0.mp4 3 0.mp4_frame3.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
4 0.mp4 4 0.mp4_frame4.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
... ... ... ... ...
86 7.mp4 5 7.mp4_frame5.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6
87 7.mp4 6 7.mp4_frame6.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6
88 7.mp4 7 7.mp4_frame7.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6
89 7.mp4 8 7.mp4_frame8.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6
90 7.mp4 9 7.mp4_frame9.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6

91 rows ร— 4 columns

Training 💪

The solution is not explored yet.

In [ ]:
## creating an empty list
#train_images = []

## for loop to read and store frames
#for i in tqdm(range(image_df.shape[0])):
#    # loading the image and keeping the target size as (224,224,3)
#    img = image.load_img('data/video/train_frame/' + image_df['ImageID'][i], target_size=(224,224,3))
#    # converting it to array
#    img = image.img_to_array(img)
#    # normalizing the pixel value
#    img = img/255
#    # appending the image to the train_image list
#    train_images.append(img)
#train_array = np.array(train_images)

FEN Notation 📝

I already developed a model (3rd puzzle of this challenge) to translate a chessboard image to its FEN Notation. One solution could be to compare the FEN Notations of 2 pictures from the same video, and build a quite complex function to identify required moves to go from one notation to the second.

In [67]:
# load json and create model
json_file = open('FEN_model.json', 'r')
FEN_model_json = json_file.read()
FEN_model = keras.models.model_from_json(FEN_model_json)
# load weights into new model
print("Loaded FEN model from disk")
Loaded FEN model from disk

Translate FEN Notation from new images

In [68]:
def pred_gen(features, batch_size):
    for i, img in enumerate(features):
        yield process_image(img)

def process_image(img):
    downsample_size = 200
    square_size = int(downsample_size/8)
    img_read = io.imread(img)
    img_read = transform.resize(
      img_read, (downsample_size, downsample_size), mode='constant')
    tiles = view_as_blocks(img_read, block_shape=(square_size, square_size, 3))
    tiles = tiles.squeeze(axis=2)
    return tiles.reshape(64, square_size, square_size, 3)

By applying the FEN model, we create an array of 64 values (one for each case of the chessboard), and this for each image. We already developped a function to translate this array into a FEN Notation. But for the purpose of this challenge (translate moves), we should developped a new function.

--> How build a function that compare 2 successive arrays, detect difference and associate it to a move?

In [69]:
path1 = 'data/video/train_frame/' + image_df['ImageID'][0]
oh1 = FEN_model.predict(process_image(path1)).argmax(axis=1).reshape(-1, 8, 8)[0]
array([[12,  3, 12, 12, 12, 12,  1, 12],
       [ 2, 12, 12, 12, 12, 12, 12, 12],
       [12, 12, 12, 12,  0,  6,  4,  0],
       [ 1, 12, 12, 12, 12, 12, 12, 12],
       [12, 12,  6, 12, 12, 12, 12,  0],
       [ 3, 12, 12, 12, 12, 12, 12,  6],
       [ 6, 12,  7, 12, 10, 12,  6, 12],
       [12, 12, 12, 12, 12, 12, 12,  7]])
In [70]:
path2 = 'data/video/train_frame/' + image_df['ImageID'][1]
oh2 = FEN_model.predict(process_image(path2)).argmax(axis=1).reshape(-1, 8, 8)[0]
array([[12,  3, 12, 12, 12, 12,  1, 12],
       [ 2, 12, 12, 12, 12, 12, 12, 12],
       [12, 12, 12, 12,  0,  6,  4,  0],
       [ 1, 12, 12, 12, 12, 12, 12, 12],
       [12, 12,  6, 12, 12, 12, 12,  0],
       [ 3, 12, 12, 10, 12, 12, 12,  6],
       [ 6, 12,  7, 12, 12, 12,  6, 12],
       [12, 12, 12, 12, 12, 12, 12,  7]])
In [71]:
def move_from_2onehots(oh1, oh2):
    case = ''
    for j in range(8):
        for i in range(8):
            if(oh1[j][i] != oh2[j][i]):
                case += 'abcdefgh'[i] + str(8-j)
    if(case == ''):
      #print('no move')
      return case

    if(oh2[8-int(case[1])][string.ascii_lowercase.index(case[0])] != 12):
      output = case[2:] + case[:2]
      output = case

    return output

move_from_2onehots(oh1, oh2)
In [ ]:
train_moves = []

for i in tqdm(range(image_df.shape[0]-1)):

    if(image_df['ImageID'][i].split('_')[0] == image_df['ImageID'][i+1].split('_')[0]):
        path1 = 'data/video/train_frame/'+ image_df['ImageID'][i]
        oh1 = FEN_model.predict(process_image(path1)).argmax(axis=1).reshape(-1, 8, 8)[0]

        path2 = 'data/video/train_frame/'+ image_df['ImageID'][i+1]
        oh2 = FEN_model.predict(process_image(path2)).argmax(axis=1).reshape(-1, 8, 8)[0]

        detected_move = move_from_2onehots(oh1, oh2)

In [ ]:
image_df['moves'] = train_moves
Out[ ]:
VideoID frame ImageID label moves
0 0.mp4 0 0.mp4_frame0.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2 e2d3
1 0.mp4 1 0.mp4_frame1.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
2 0.mp4 2 0.mp4_frame2.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2 b8d7
3 0.mp4 3 0.mp4_frame3.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
4 0.mp4 4 0.mp4_frame4.jpg e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2 d3d2
... ... ... ... ... ...
86 7.mp4 5 7.mp4_frame5.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6
87 7.mp4 6 7.mp4_frame6.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6 f4e3
88 7.mp4 7 7.mp4_frame7.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6 f7f5
89 7.mp4 8 7.mp4_frame8.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6
90 7.mp4 9 7.mp4_frame9.jpg a8d8 d1c1 f8g7 f4e3 f7f5 b5c6

91 rows ร— 5 columns

In [ ]:
video_moves = []

output = image_df['moves'][0]
for i in tqdm(range(image_df.shape[0]-1)):
    if(image_df['VideoID'][i+1] == image_df['VideoID'][i]):
      if(image_df['moves'][i+1] != ''):
        output += ' '
        output += image_df['moves'][i+1]
      output = output = image_df['moves'][i+1]

100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 90/90 [00:00<00:00, 25187.65it/s]

Out[ ]:
['e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2',
 ' c2b3 e1f1 f7e7 a3a4 e7e3 a4a5',
 ' f1c1 h6h7 h4f3 h2g4 f6g6 g4f6 c1g1 f6d7 g6f5',
 ' h4h5 a7b5 c4b2 d7e6 g1h3',
 ' c6g6 b1c2 g6g5 a2b2 h4f5 e4d3',
 ' f1e1 a8e8 e1f1 c5d3 b1b5 f8e7',
 'a3a1 f8f7 a1f1 e7c7 f2e3 e5f6 f1c1 f6c3',
 'a8d8 d1c1 f8g7 f4e3 f7f5']
In [ ]:
video_df.head(n = 8)
Out[ ]:
VideoID label
0 0.mp4 e2d3 b8d7 d3d2 g6h7 c4c5 a5a4 d2c1 a7c5 c2f2
1 1.mp4 c2b3 e1f1 f7e7 a3a4 e7e3 a4a5
2 2.mp4 f1c1 h6h7 h4f3 h2g4 f6g6 g4f6 c1g1 f6d7 g6f5
3 3.mp4 h4h5 a7b5 c4b2 d7e6 g1h3 h6f7
4 4.mp4 c6g6 b1c2 g6g5 a2b2 h4f5 e4d3
5 5.mp4 f1e1 a8e8 e1f1 c5d3 b1b5 f8e7 f1c1
6 6.mp4 a3a1 f8f7 a1f1 e7c7 f2e3 e5f6 f1c1 f6c3 b3c3
7 7.mp4 a8d8 d1c1 f8g7 f4e3 f7f5 b5c6

It seems to work, but not perfectly, as I missed sometimes the last moves! It's due to images, I don't know why, but sometimes I didn't capture the last view of the video. Maybe I will have to increase the number of frames. Let's try to make a submission with this.

Apply to test dataset

In [72]:
!rm -rf data/video/test_frame
!mkdir data/video/test_frame
In [ ]:
for i in tqdm(range(video_dftest.shape[0])):
  count = 0
  videoFile = video_dftest['VideoID'][i]
  cap = cv2.VideoCapture('data/video/test/' + videoFile)  # capturing the video from the given path
  frameRate = 1                                           # frame rate
  x = 1
      frameId = cap.get(1)                                 # current frame number
      ret, frame = cap.read()
      if (ret != True):
          print ('break')
      if (frameId % math.floor(frameRate) == 0):
          print ('store')
          filename ='data/video/test_frame/' + videoFile + "_frame%d.jpg" % count;count+=1
          cv2.imwrite(filename, frame)                      # storing the frames

print ("Done!")
In [74]:
# - getting the names of all the images
images_test = glob("data/video/test_frame/*.jpg")
video_name = []
image_name = []
frame_number = []

for i in tqdm(range(len(images_test))):
    # - creating the image name
    imageName = images_test[i].split('/')[3]
    # - creating the image label
    videoName = images_test[i].split('/')[3].split('_')[0]
    frameNb = images_test[i].split('/')[3].split('_')[1].split('.')[0][5:]
    frameNb = int(frameNb)

# - storing the images and their class in a dataframe
image_dftest = pd.DataFrame()
image_dftest['VideoID'] = video_name
image_dftest['frame'] = frame_number
image_dftest['ImageID'] = image_name

# - converting the dataframe into csv file 
image_dftest.to_csv('data/video/image_dftest.csv', header = True, index = False)

image_dftest = image_dftest.sort_values(['VideoID', 'frame'])
image_dftest.index = range(image_dftest.shape[0])
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 43910/43910 [00:00<00:00, 335251.75it/s]
VideoID frame ImageID
0 0.mp4 0 0.mp4_frame0.jpg
1 0.mp4 1 0.mp4_frame1.jpg
2 0.mp4 2 0.mp4_frame2.jpg
3 0.mp4 3 0.mp4_frame3.jpg
4 0.mp4 4 0.mp4_frame4.jpg
... ... ... ...
43905 999.mp4 23 999.mp4_frame23.jpg
43906 999.mp4 24 999.mp4_frame24.jpg
43907 999.mp4 25 999.mp4_frame25.jpg
43908 999.mp4 26 999.mp4_frame26.jpg
43909 999.mp4 27 999.mp4_frame27.jpg

43910 rows ร— 3 columns

In [ ]:
#pred = (
#  FEN_model.predict_generator(pred_gen(images_test, 64), steps=1000)
#  .argmax(axis=1)
#  .reshape(-1, 8, 8)
In [75]:
test_moves = []

for i in tqdm(range(image_dftest.shape[0]-1)):

    if(image_dftest['ImageID'][i].split('_')[0] == image_dftest['ImageID'][i+1].split('_')[0]):
        path1 = 'data/video/test_frame/'+ image_dftest['ImageID'][i]
        oh1 = FEN_model.predict(process_image(path1)).argmax(axis=1).reshape(-1, 8, 8)[0]
        path2 = 'data/video/test_frame/'+ image_dftest['ImageID'][i+1]
        oh2 = FEN_model.predict(process_image(path2)).argmax(axis=1).reshape(-1, 8, 8)[0]
        detected_move = move_from_2onehots(oh1, oh2)


image_dftest['moves'] = test_moves
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 43909/43909 [2:51:25<00:00,  4.27it/s]
VideoID frame ImageID moves
0 0.mp4 0 0.mp4_frame0.jpg
1 0.mp4 1 0.mp4_frame1.jpg
2 0.mp4 2 0.mp4_frame2.jpg b4f8
3 0.mp4 3 0.mp4_frame3.jpg
4 0.mp4 4 0.mp4_frame4.jpg
... ... ... ... ...
43905 999.mp4 23 999.mp4_frame23.jpg g1g2
43906 999.mp4 24 999.mp4_frame24.jpg
43907 999.mp4 25 999.mp4_frame25.jpg
43908 999.mp4 26 999.mp4_frame26.jpg b8a8
43909 999.mp4 27 999.mp4_frame27.jpg

43910 rows ร— 4 columns

In [76]:
video_moves = []

output = image_dftest['moves'][0]
for i in tqdm(range(image_dftest.shape[0]-1)):
    if(image_dftest['VideoID'][i+1] == image_dftest['VideoID'][i]):
      if(image_dftest['moves'][i+1] != ''):
        output += ' '
        output += image_dftest['moves'][i+1]
      output = output = image_dftest['moves'][i+1]

  9%|โ–‰         | 4143/43909 [00:00<00:00, 41429.15it/s]

100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 43909/43909 [00:01<00:00, 43721.84it/s]
In [77]:
video_dftest['label'] = video_moves
VideoID label
0 0.mp4 b4f8 a6a8 b1c3 e8d7 a3a1 b8c6
1 1.mp4 b3b2 d1c3 b2d2 b6b3 a2b2 f3e2 a3b5 e2f3 b2b3
2 2.mp4 g7f7 c3f3 h5h4 f3f2 f8d6 f1g1 f7f2 d4d5 f2f5
3 3.mp4 e5e6 b4a3 a7h7 b5b3 h2g1 b3b1 g1f2 b1g1 h7d7
4 4.mp4 e5e4 b3b2 e4d4 c1d2 c6d8 f4e6 d8e6 a2a4 e6d8
... ... ...
1995 1995.mp4 d8g8 c1b1 b2c2 b6d7 g7e5 g3f3
1996 1996.mp4 h7h5 a1c1 a8a6 d2d3 f7f5
1997 1997.mp4 c3c5 g5h7 c5b5 d3d4 b5c5 d2d1 c5c4 d1d2 c4c1
1998 1998.mp4 e3e6 c8e7 b1c3 f7f5 c5b6
1999 1999.mp4 c6h1 a1d1 d3f2 f1f2 a7a6 e1g1 c8d8 g1g2 b8a8

2000 rows ร— 2 columns

Submission ✉

In [78]:
submission = pd.read_csv("data/video/sample_submission.csv")
submission['label'] = video_moves
submission.to_csv("data/video/submission.csv", index=False)
In [79]:
%aicrowd submission create -c chess-transcription -f data/video/submission.csv
You must login before you can post a comment.