in Machine Learning

This article would succinctly describe the best ten image datasets used for certain fundamental computer vision problems such as classification, detection and segmentation. Considering traditional computer vision approaches and also to encourage audience who are resource constrained and to seed an idea of getting started with computer vision, this article is planned and crafted in such a way that the list also includes some smaller image datasets.

Open Image Dataset Resources



Imagenet is more or less the de facto in the computer vision problem of classification since the deep learning revolution. It contains more than 14M images with 21841 synsets. To enable you download such huge data, the organizers have provided the options to download raw images, urls, sift features, bounding boxes and object attributes. As an added advantage, it also has API integration.

Classification SOTA: 3.57% top-5 error (ResNet 2015). The detection problem has 150 images per each of 3k synsets.

Detection SOTA: 73.1 mAP for 85 object categories.



Covering 20 classes with 11.5k images and 27.5 objects, PASCAL VOC has been used for segmentation with 7k labeled images. PASCAL VOC object detection challenge has been closed after a 7 year run and the excerpts are published.

Detection SOTA : 75.9 mAP at IoU = 0.5
Segmentation SOTA : 89.0 mAP


[Detection][Segmentation][Image Captioning][Keypoint detection]

With more than 200k labeled images containing 1.5M instances of 80 classes, MS COCO has also been annotated with 5 captions per image. They also contain 250k people with keypoint annotations.

Detection SOTA : 0.52 mAP at IoU = s0.5:0.05:0.95
Segmentation SOTA : 0.48 mAP
Keypoint SOTA: 0.76 mAP


[Action Recognition]
1M sports videos of average length-5.5mins labelled for 487 sports classes.
SOTA: 73.3%


[Action Recognition]
Curated set of 8M YouTube videos that are between 2-10mins have at least 1000 views. It has been labeled for 4800 entities. The average video length is about 4 minutes.
SOTA: 0.839 GAP(Global Average Precision)

Those are the big shots. Are you constrained with resources and still interested in kick-starting with deep learning? You could use the following smaller image datasets for tasks such as classification.


CIFAR-10 consists of 60k images of smaller dimension(32×32) that are classified into 10 classes; could be used for trying out SIFT based approaches or maybe build a custom CNN of your own.



CIFAR-100 is an image dataset for fine-grained classification problem, it’s compiled to contain 100 classes with super classes. Each class contain 500 training images and 100 test images.

CALTECH datasets


CALTECH-101 – 101 classes with 40-800 images per class with dimension 300×200 pixels that are compiled to enable classification. CALTECH-256, a scale extension of its predecessor, contains 256 classes encompassing 30607 images.