Zero-Shot Learning: “The Seen, The Unseen and The Unknown”

A thought framework for Zero-Shot Learning

6 min readAug 30, 2021

In this story, we try to build a simple framework to think about the seen classes, the unseen classes and the unknown classes while dealing with Zero-Shot Learning problems.

On a weekend, pre-covid, you go to a birdwatching park. You are a novice bird watcher and hence you ask a friend to help you recognize the birds that you might encounter.

The Seen

You just know 1 type of bird that you have seen around your house during your childhood: the House Sparrow

House Sparrow is ‘The Seen’ bird type.

How you recognize this bird is:

You capture its image when you see it.
Process it through the vision model of your brain.
The model will then generate a semantic vector after processing the image.
Then you will match this generated semantic vector with your semantic vector of a House Sparrow which has been trained by multiple instances of the bird’s image throughout your childhood.
If the similarity of these vectors is higher than a threshold, you are confident that the bird you are looking at is a House Sparrow.

The Unseen

Your friend knows your lack of knowledge of birds, hence they give you a bird’s description to help you out during your birdwatching picnic.

It looks like a sparrow but has a yellow beak. It looks like it is wearing a white crown. It’s a White-crowned sparrow.

On your picnic, you witness a lot of birds, a few Home Sparrows too, but you see a couple that looked like sparrows but weren't Home Sparrows (lower than your similarity threshold).

Do you recognise any of these? You recognise the white-crowned sparrow instantly!

White-crowned sparrow, Brewer sparrow, Black-throated sparrow

White-crowned Sparrow is ‘The Unseen’ bird type.

How you recognise the bird is:

You pass the text description of the bird through your multi-modal model in your brain and generate a semantic vector of the description.
You take in the new image of the bird, pass it through the same multi-modal model and generate a semantic vector of the image.
You will calculate the similarity of these semantic vectors.
If the similarity of these vectors is higher than a threshold, you are confident that the bird you are watching is a White-crowned Sparrow.

The Unknown

After impressively recognizing 2 types of sparrows, you are thrilled and can’t wait to go back and gloat about your birdwatching story to your expert friend. But as you are about to leave, you witness a very peculiar type of bird. You have never seen such a bird before. It’s not just different from the house sparrow or the white-crowned sparrow, but it does not resemble a sparrow in any way.

So you go back to your friend and describe the bird that you had seen. In no way can you associate this bird with a name, but you sure can describe it.

Red-headed Woodpecker is ‘The Unknown’ bird type.

How you describe the bird is:

You look at the bird, pass it through the multimodal model in your brain and then it generates a semantic vector.
Using this semantic vector, you convert your experience to describe it into words.
You say, “It is a tall bird with long pointy beak, red-coloured head, black eyes, white belly and half black, half white wing.”

Obviously listening to this description your expert friend can recognize this as a Red-headed Woodpecker with good certainty.

Comparing with ZSL-models

Of course, the other 2 sparrows (Brewer sparrow, Black-throated sparrow) are also The Unknown type, but in that scenario, the semantic vector generated can be used to find the nearest neighbours in your vector space, and that will give you the closest neighbours as the home sparrow and the white-crowned sparrow, as all of them belong to the same domain — sparrow. Hence you can describe them as ‘sparrow’.

But in the case of the red-headed woodpecker, although it is still a bird, you can’t use the word ‘sparrow’ to describe it, as it is further apart from the existing vectors.

We take for granted how amazing our multi-modal brain model works in these day to day mundane scenarios, seamlessly converting text and image to a common semantic space. Just as we saw in this story, the model too will use the same method to classify: seen, unseen and unknown classes.

Classifying seen and unseen classes with the textual explanations for the prediction

Although just like us it won’t be perfect at classifying the unseen classes all the time, it will predict an unseen bird to a class that is semantically quite similar to the actual class.

Unseen Class | Ground-truth class | Predicted class

And in the case of an unknown class, it too will run a nearest neighbours similarity and find semantically similar classes.

The seen, the unseen and the unknown classes are differentiated by the type of data that we have.

One-shot and Few-shot Classification

If we have images of the class in our data then it is a seen class.
In the case of seen class classification, the problem boils down to a normal classification problem.
If the number of images for the seen class is one or a few then it becomes a one-shot or few-shot classification problem.

The face detection algorithm in your smartphone uses few-shot learning to learn to identify your face.

Zero-shot Classificaion

If we just have the textual/semantic description of it and no images, then it is an unseen class.
Classification on unseen classes only is called zero-shot (ZS)classification.
Classification on both seen and unseen classes is called generalized zero-shot (GZS) classification.

One classic example for understanding zero-shot classification is: describing a zebra as a horse with black and white stripes to a person who has never seen a zebra.

When you have neither the text nor the image of the class then it is referred to as an unknown class. The way to abstract information from such classes is by finding the nearest neighbours in the semantic vector space to find similar classes, as we did earlier.

References

The images of the birds in this blog are used from the CUB-200 dataset
The images from the ‘comparing with ZSL-models’ section have been taken from the paper: A Deep Multi-Modal Explanation Model for Zero-Shot Learning
I have a video explanation of this paper and many more on my YouTube channel.

About Author

I am a senior CS undergraduate with a keen interest in Deep Learning. I try to write blogs and I make explanation videos on my YouTube channel where I explain niche and trending deep learning papers every week. I am also open to research collaborations, you can find my details on my website.