This chapter of the guide includes many interactive elements that currently only run well in the Chrome browser. This chapter relies upon there being a camera that the browser can access. There is a troubleshooting guide that should be consulted if problems are encountered.
A camera connected to a computer can report the colour of each pixel in an image and not much else. A description of what is in front of the camera is returned when those pixels are sent to an image recognition service. There are many kinds of things an image description may contain. With speech recognition the description is what was spoken, how confident the system is, and possible alternatives. With image recognition there are many possible descriptions: descriptive tags, possible captions, dominant colours, location of faces and parts of the face if there are any, and the presence of landmarks, celebrities, well-known entities, and logos. Hand-written and scanned text can also be recognised.
A challenge in providing student-friendly programming blocks for image recognition is that different AI cloud services report different descriptions in different ways. We currently provide interfaces to image recognition services provided by Google, Microsoft, and IBM. A further challenge is how to provide simple interfaces for simple tasks while still supporting more sophisticated uses and projects.
In the last few years there has been tremendous progress in computer vision. There are high performance systems for identifying objects, recognising faces, interpreting sketches, and using medical images to aid diagnosis. Driverless cars rely heavily upon computer vision.
Chrome has builtin support for speech synthesis and recognition. No browser, however, currently supports image recognition. To access vision recognition services from companies such as Google, Microsoft, or IBM one needs to open an account. Accounts are free and provide some degree of free usage. IBM permits 250 free vision queries per day, Microsoft 5000 per month, and Google 1000 per month. To try the vision blocks described here you need at least one account. Comparing and contrasting the results from different services is an interesting ways to gain some insight into how these services work.
An alternative to using AI cloud services for image classification is to load deep learning vision models into the browser. Some of the blocks in the next chapter can be used to analyse images. Some can be further trained by users which is especially useful when one wants to distinguish between images for idiosyncratic reasons (e.g. different facial expressions). One has been trained to recognise one thousand different kinds of things (but not people). While the models that run in the browser are not as capable or accurate as AI cloud services they have the advantage that no data leaves your device, maintaining privacy. Furthermore they work even without a network connection.
In projects there is a different way of providing keys to Snap!
(as described here) but in this chapter you can paste your key or keys below
and they'll be passed to the services as you use example blocks.
This block takes a picture, sends it to the AI vision cloud server provider, waits for a response, and then reports a list of labels of the photo. The list is ordered by how confident the vision provider is that the label matches the image.
The show current photo block will display the most recent image sent to the specified AI vision services as the background of the Snap! stage. To try it first obtain some labels from a vision service. You can also use the use camera to create costume block to take a new picture and add it as a costume to the current sprite.
Here is a program that can contact any of Google, Microsoft, or IBM and displays the tags returned.
This program is similar to the previous one except it is listening for the words "Google", "Microsoft", or "Watson". When it hears one of these words it contacts that service. When the response is received it turns it into speech. As a demo it is impressive if you say things like "Tell me Google what do you see?" and "What do you see Microsoft?". This will work in every language. It is a nice illustration of how software can appear more intelligent than it is. Do you think this program seems intelligent? If so, why? Is it intelligent?
Image recognisers can do more than label images. Some can detect and locate faces. Some of those will estimate the age of the person and their gender. Some recognise landmarks and logos. Many can recognise characters in hand-written or scanned text. A problem creating Snap! blocks that provide this functionality is that different services have different capabilities and different structures to capture their responses.
Click to read an advanced topic
The Recognize new photo block addresses this problem by reporting a data structure capturing the entire response from the vision service. It is a Snap! list that contains lists that contains more data that might be text for labels, numbers for confidence scores, and even more lists for complex data. You can double click on the icons to drill down the structure.
The Current image property block takes an argument that describes what piece of the response structure should be reported. For example, Microsoft offers possible captions that can be found by following the path "description captions text". Common useful blocks are defined that use the Current image property block internally to get the labels and the confidence scores of the labels for each of the supported vision service providers. Note that after a response is received from any of the AI services it is stored so that calls to Current image property use the most recent response rather than ask for a new one. This is because a project may need to access multiple pieces of a response.
After having run Recognize new photo block you should select the same AI cloud provider (currently Google, Microsoft, or IBM Watson) when you run the Current image property block. The second argument can be a string or a list of strings specifying what information is desired from the image recognition. Each AI cloud provider supports different image properties:
Common calls to the get property reporter are pre-defined. They are IBM Watson classes, IBM Watson scores, Microsoft labels, Microsoft first caption, Google labels, and Google label scores.
Exercise. Recognise a picture and then use the Current image property block to see what kinds of descriptions the service provides. How might each one be useful?
Advanced Exercise. Compare the documentation for the vision services from Google, Microsoft, and IBM Watson.
Click to read another advanced topic
The Recognize new photo reporter block is implemented using the Ask <provider> to say what it sees block. This block does not wait for a response, instead it runs the user's blocks when a response is received. It isn't as convenient as 'Recognize new photo' but it can support more complex usage.
This chapter uses the API keys provided in the text areas above. When constructing a project there are two ways to provide keys:
Because all the AI cloud services are commercial services heavy use can be costly. Consequently it is best if each student has their own account and minimises the sharing of their keys. This conflicts with one of the great things about tools like Scratch and Snap! -- that it is so easy to share one's projects with a wider community. Adding the keys to the URL solves this problem if one is careful to keep the URL with keys private and share the version without the keys. It is possible but more awkward to maintain private and public versions when using reporters for keys.
This is like asking what is vision good for. According to Wikipedia eyes have independently evolved between 50 and 100 times since animals first appeared. Computer vision enables robots to see, self-driving cars (which really are just a kind of robot), new kinds of user interfaces, support for doctors, police, sports coaches, farmers, the military, and more. It can help blind people to navigate their surroundings and to use devices.
Like much of AI technology it may take away many jobs. It may reduce our privacy since it makes it much easier to track people's movements and activities. It can enable autonomous weapons that may kill many.
Another danger is that either intentionally or unintentionally the data the machine learning system is trained on may contain biases. Here is a nice short video on bias and machine learning published by Google.
Image recognition begins with pixels. Each pixel is a number or three numbers (for red, green, and blue components) that corresponds to a tiny piece of an image. There are two main approaches to processing images:
Most of the recent progress has been with machine learning systems that for many tasks are near human level. For some tasks such as detecting lung cancer from x-rays machine learning has exceeded the ability of experts.
A neural net is a program that is inspired by what little is known about how neurons in animal and human brains work. It can be trained to identify images. In the most common case it is trained on thousands or millions of images that have already been labelled. When given a new image and it computes the most likely labels. Some easy visual discrimination tasks can be handled by training only a hundred or less images. You can train a neural net to do this kind of vision recognition in the next chapter.
Different levels or stages of the processing by the neural net recognise different elements of an image. Researchers have built software that figures out what images produce the greatest response from the different layers.
Some AI cloud providers such as Google can accept a video stream and produce labels. It can also detect scene changes. This service can be expensive, though Google will analyse 1000 minutes of video per month at no cost. There currently are no Snap! blocks for supporting video input.
Adding image recognition to a robotics project can make the robot behave much more intelligently. It can head towards goals and avoid specified objects. A simple example would be a robot told to move to X (where X can be "the red ball", "the toy truck", "a person", or whatever). It begins by sending the current camera image for recognition. If the description returned matches X then go forward a few steps (assuming the camera is mounted pointing forward). If not then turn a little and try again. Repeat until X is reached.
For projects that use a camera that can't move there are many possibilities:
Click to read an advanced topic
Google, Microsoft, and IBM Watson document their AI vision services. There are many more vision services, see for example this image recognition services comparison page. The Wikipedia page on computer vision is very thorough. There are many MOOCS on computer vision and machine learning for image recognition. Distill is a scientific journal that strives to explain clearly complex machine learning topics. A recent Distill article is an in-depth description of how how a neural net does image recognition. A general audience summary of the article was written by the New York Times. Codecademy has a good interview with a data scientist about machine learning.
The next chapter is about machine learning.
Return to the previous chapter on speech recognition.