A guide to building AI apps and artefacts

Chapter 3 - Adding image recognition to programs

Ken Kahn, University of Oxford

You can export the blocks presented here as a project or download as a library to import into your projects.

Browser compatibility

This chapter of the guide includes many interactive elements that currently only run well in the Chrome browser. This chapter relies upon there being a camera that the browser can access. There is a troubleshooting guide that should be consulted if problems are encountered.


A camera connected to a computer can report the colour of each pixel in an image and not much else. A description of what is in front of the camera is returned when those pixels are sent to an image recognition service. There are many kinds of things an image description may contain. With speech recognition the description is what was spoken, how confident the system is, and possible alternatives. With image recognition there are many possible descriptions: descriptive tags, possible captions, dominant colours, location of faces and parts of the face if there are any, and the presence of landmarks, celebrities, well-known entities, and logos. Hand-written and scanned text can also be recognised.

A challenge in providing student-friendly programming blocks for image recognition is that different AI cloud services report different descriptions in different ways. We currently provide interfaces to image recognition services provided by Google and Microsoft. A further challenge is how to provide simple interfaces for simple tasks while still supporting more sophisticated uses and projects.

In the last few years there has been tremendous progress in computer vision. There are high performance systems for identifying objects, recognising faces, interpreting sketches, and using medical images to aid diagnosis. Driverless cars rely heavily upon computer vision.

API keys are needed

Chrome has built-in support for speech synthesis and recognition. No browser, however, currently supports image recognition. To access vision recognition services from companies such as Google or Microsoft one needs to open an account. Accounts are free and provide some degree of free usage. Microsoft permits 5000 queries per month, and Google 1000 per month. To try the vision blocks described here you need at least one account. Comparing and contrasting the results from different services are interesting ways to gain some insight into how these services work.

An alternative to using AI cloud services for image classification is to load deep learning vision models into the browser. Some of the blocks in the next chapter can be used to analyse images. Some can be further trained by users which is especially useful when one wants to distinguish between images for idiosyncratic reasons (e.g. different facial expressions). One has been trained to recognise one thousand different kinds of things (but not people). While the models that run in the browser are not as capable or accurate as AI cloud services they have the advantage that no data leaves your device, maintaining privacy. Furthermore, they work even without a network connection.

In projects there is a different way of providing keys to Snap! (as described here) but in this chapter you can paste your key or keys below and they'll be passed to the services as you use example blocks.

A simple image recognition block

This block takes a picture, sends it to the AI vision cloud server provider, waits for a response, and then reports a list of labels of the photo. The list is ordered by how confident the vision provider is that the label matches the image.

Displaying the image that was sent for recognition

The show current photo block will display the most recent image sent to the specified AI vision services as the background of the Snap! stage. To try it first obtain some labels from a vision service. You can also use the use camera to create costume block to take a new picture and add it as a costume to the current sprite.

A sample program using image recognition

Here is a program that can contact any of Google or Microsoft and displays the tags returned.

A sample program combining image and speech recognition and speech output

This program is similar to the previous one except it is listening for the words "Google"or "Microsoft". When it hears one of these words it contacts that service. When the response is received it turns it into speech. As a demo it is impressive if you say things like "Tell me Google what do you see?" and "What do you see Microsoft?". This will work in every language. It is a nice illustration of how software can appear more intelligent than it is. Do you think this program seems intelligent? If so, why? Is it intelligent?

Advanced image recognition blocks

Image recognisers can do more than label images. Some can detect and locate faces. Some of those will estimate the age of the person and their gender. Some recognise landmarks and logos. Many can recognise characters in hand-written or scanned text. A problem creating Snap! blocks that provide this functionality is that different services have different capabilities and different structures to capture their responses.

Getting properties of an image

Click to read an advanced topic

The Recognize new photo block addresses this problem by reporting a data structure capturing the entire response from the vision service. It is a Snap! list that contains lists that contains more data that might be text for labels, numbers for confidence scores, and even more lists for complex data. You can double click on the icon for opening lists in lists icons to drill down the structure.

The Current image property block takes an argument that describes what piece of the response structure should be reported. For example, Microsoft offers possible captions that can be found by following the path "description captions text". Common useful blocks are defined that use the Current image property block internally to get the labels and the confidence scores of the labels for each of the supported vision service providers. Note that after a response is received from any of the AI services it is stored so that calls to Current image property use the most recent response rather than ask for a new one. This is because a project may need to access multiple pieces of a response.

After having run Recognize new photo block you should select the same AI cloud provider (currently Google or Microsoft) when you run the Current image property block. The second argument can be a string or a list of strings specifying what information is desired from the image recognition. Each AI cloud provider supports different image properties:

Common calls to the get property reporter are pre-defined. They are Microsoft labels, Microsoft first caption, Google labels, and Google label scores.

Exercise. Recognise a picture and then use the Current image property block to see what kinds of descriptions the service provides. How might each one be useful?

Advanced Exercise. Compare the documentation for the vision services from Google and Microsoft.

Click to read another advanced topic

The Recognize new photo reporter block is implemented using the Ask <provider> to say what it sees block. This block does not wait for a response, instead it runs the user's blocks when a response is received. It isn't as convenient as 'Recognize new photo' but it can support more complex usage.

How to provide API keys

This chapter uses the API keys provided in the text areas above. When constructing a project there are two ways to provide keys:

  1. Add extra information to the page's URL. Appending any (or all) of &Google image key=... or &Microsoft image key=... to a shared project URL will provide the keys (after "..." is replaced by the real keys). You can provide only one key if you aren't interested in comparing the responses from different AI service providers. Refresh the page after adding the keys.
  2. Edit one or both of the global variables Google vision key or Microsoft vision key. Note these variables are declared as transient so they will not be set if you save and load your project.

Because all the AI cloud services are commercial services heavy use can be costly. Consequently, it is best if each student has their own account and minimises the sharing of their keys. This conflicts with one of the great things about tools like Scratch and Snap! -- that it is so easy to share one's projects with a wider community. Adding the keys to the URL solves this problem if one is careful to keep the URL with keys private and share the version without the keys.

What is image recognition good for?

This is like asking what is vision good for. According to Wikipedia eyes have independently evolved between 50 and 100 times since animals first appeared. Computer vision enables robots to see, self-driving cars (which really are just a kind of robot), new kinds of user interfaces, support for doctors, police, sports coaches, farmers, the military, and more. It can help blind people to navigate their surroundings and to use devices.

What are the dangers of image recognition?

Like much of AI technology it may take away many jobs. It may reduce our privacy since it makes it much easier to track people's movements and activities. It can enable autonomous weapons that may kill many.

Another danger is that either intentionally or unintentionally the data the machine learning system is trained on may contain biases. Here is a nice short video on bias and machine learning published by Google.

How does computer vision work?

Image recognition begins with pixels. Each pixel is a number (for the gray level) or three numbers (for red, green, and blue components) that corresponds to a tiny piece of an image. There are two main approaches to processing images:
  1. A sequence of programmed processing steps to find edges, determine textures, identify objects, etc.
  2. A machine learning system

Most of the recent progress has been with machine learning systems that for many tasks are near or better than humans. For some tasks such as detecting lung cancer from x-rays machine learning has exceeded the ability of experts.

A neural net is a program that is inspired by what little is known about how neurons in animal and human brains work. It can be trained to identify images. In the most common case it is trained on thousands or millions of images that have already been labelled. When given a new image and it computes the most likely labels. Some easy visual discrimination tasks can be handled by training only a hundred or less images. You can train a neural net to do this kind of vision recognition in the next chapter.

Different levels or stages of the processing by the neural net recognise different elements of an image. Researchers have built software that figures out what images produce the greatest response from the different layers. A very nice interactive tool for exploring what different pieces of a neural network respond to is the OpenAI Microscope.

Ideal images for an early neural net layer that looks for edges
Ideal images for a later neural net layer that looks for textures
Ideal images for a later neural net layer that looks for patterns
Ideal images for a later neural net layer that looks for parts
Ideal images for a late neural net layer that looks for objects

What about video recognition services?

Some AI cloud providers such as Google can accept a video stream and produce labels. It can also detect scene changes. This service can be expensive, though Google will analyse 1000 minutes of video per month at no cost. There currently are no Snap! blocks for supporting video input.

Possible project ideas using image recognition

Adding image recognition to a robotics project can make the robot behave much more intelligently. It can head towards goals and avoid specified objects. A simple example would be a robot told to move to X (where X can be "the red ball", "the toy truck", "a person", or whatever). It begins by sending the current camera image for recognition. If the description returned matches X then go forward a few steps (assuming the camera is mounted pointing forward). If not, then turn a little and try again. Repeat until X is reached.

For projects that use a camera that can't move there are many possibilities:

  1. Engage in a simple conversation about what it thinks it is seeing.
  2. Respond to what it sees. E.g. says "How cute" when the description includes words such as "kitten" or "puppy" but says "How scary!" when the description is a toy lion or wolf.
  3. Enhance the story generator example so that in addition to tokens that ask for phrases or names, a new kind of token is introduced that causes the current image to be recognised and then substitutes the resulting description into that location of the story.
  4. When a face is recognised its location and the location of parts of the face are included in the response. The app can then add know where to place glasses, moustaches, etc. to the image.
  5. And thousands more possibilities.

Additional resources

Click to read an advanced topic

Google and Microsoft document their AI vision services. There are many more vision services, see for example this image recognition services comparison page. The Wikipedia page on computer vision is very thorough. There are many MOOCS on computer vision and machine learning for image recognition. Distill is a scientific journal that strives to explain clearly complex machine learning topics. A Distill article is an in-depth description of how a neural net does image recognition. A general audience summary of the article was written by the New York Times. Codecademy has a good interview with a data scientist about machine learning.

Where to get these blocks to use in your projects

You can export the blocks presented here as a project or download as a library to import into your projects.

Learn about machine learning

The next chapter is about machine learning.

Return to the previous chapter on speech recognition.