Computer vision is an interdisciplinary scientific field that involves how to enable computers to obtain high-level understanding from digital images or videos. From an engineering perspective, it seeks to automate tasks that the human visual system can perform.
Deep learning has allowed rapid development in the field of computer vision in recent years. In this article, I want to discuss a specific task called semantic segmentation in computer vision. Although researchers have found many ways to solve this problem, I will discuss a specific architecture, UNET, which uses a fully convolutional network model to accomplish the task.
We will use UNET to create the first step solution to the TGS salt identification challenge proposed by Kaggle.
In addition, the purpose of my blog is to provide some intuitive ideas about common operations and terms in convolutional networks for understanding images. Some of them include convolution, max grouping, receptive field, upsampling, transposed convolution, skip connection, etc.
I assume that the reader is already familiar with the basics of machine learning and convolutional networks. You should also have some working knowledge of ConvNets using Python and Keras libraries.
3. What is semantic segmentation? The computer can understand images at multiple levels of granularity. For each of these levels, there is a problem defined in the field of computer vision. From a deeper understanding to a finer understanding, let us describe these problems below:
(a)Image Classification :
The most basic building block in computer vision is the image classification problem, where, given an image, we expect the computer to generate a discrete The label, which is the main object of the image. In image classification, we assume that there is only one object (instead of multiple) in the image.
b. Classification with Localization :
When locating with discrete tags, we also expect the calculation to be able to accurately locate the position of the object in the image. The position is usually realized by a bounding box, which can be identified by some numerical parameters related to the image boundary. Even in this case, it is assumed that there is only one object per image.
(c) object detection:
object detection extends positioning to the next level. Now images are not limited to having a single object, but can also contain multiple objects. The task is to classify and locate all the objects in the image. Here again, the bounding box concept is used for positioning days.
(d) Semantic Segmentation:
The goal of semantic image segmentation is to tag each pixel in the image with the corresponding category represented. Because we want to predict every pixel in the image, this task is generally called dense prediction.
Note that unlike the previous task, the expected result in semantic segmentation is not just the label and bounding box parameters. The output itself is a high resolution image (usually the same size as the input image), where each pixel is classified into a specific category. Therefore, it is a classification of images at the pixel level.
E. Segmentation of Instances:
Segmentation of Instances Is One Step Ahead of Semantic Segmentation Along with classification at the pixel level, we want the computer to classify each instance of the class separately. For example, in the image above, there are 3 people, strictly speaking, 3 instances of the class “Person”. The 3 types are classified separately (in different colors). But semantic segmentation does not distinguish between instances of a specific class.
If you are still confused about the difference between object detection, semantic segmentation, and instance segmentation, the following figure will help clarify this:
In this article, we will learn how to use a fully convolutional network (FCN) called UNET Solve the problem of semantic segmentation
If you want to know if semantic segmentation is useful, your query is reasonable. However, it turns out that many complex tasks in Vision require this detailed understanding of images. For example:
a. Autonomous Driving Car
Autonomous driving is a complex robotic task that requires perception, planning and execution in an ever-changing environment. This task must also be performed with the highest precision, because safety is the most important. Semantic segmentation provides information about the free space of the road and the detection of lane markings and road signs.
Biomedical Imaging Diagnostics
The machine can enhance the analysis performed by radiologists, greatly reducing the time required to run diagnostic tests.
The Geo Sensing
semantic segmentation problem can also be thought of as a classification problem, where each pixel is classified as one of a series of object categories. Therefore, there is a use case for land use mapping for satellite imagery. Information on land cover is important for various applications, such as monitoring areas of deforestation and urbanization.
To identify the land cover type of each pixel in the satellite image (e.g. urban, agricultural, water area, etc.), land cover classification can be viewed as a semantic class segmentation task multiple. Inspection of roads and buildings is also an important research topic in traffic management, urban planning and road surveillance. Few large-scale data sets are publicly available (for example: SpaceNet), and data labeling has always been the bottleneck of segmentation tasks.
precision farming robots can reduce the amount of herbicides that need to be sprayed in the field. Semantic crop and weed segmentation help them trigger weeding actions in real time. These advanced agricultural imaging technologies can reduce manual monitoring of agriculture.
5. Business problem
In any machine learning task, it is always recommended to spend a lot of time correctly understanding the business problem we are trying to solve. This not only helps to effectively apply technical tools, but also inspires developers to use their skills to solve real-world problems.
TGS is a leading earth science and data company that uses seismic imaging and 3D rendering to understand which areas below the surface of the earth contain large amounts of oil and gas.
Interestingly, the surface containing oil and natural gas also contains a lot of salt. Therefore, with the help of seismic technology, they tried to predict which areas of the earth’s surface contained a lot of salt.
Unfortunately, professional seismic imaging requires professional human vision to accurately identify salt bodies. This leads to highly subjective and variable representations. In addition, if human prediction is wrong, it may cause heavy losses to oil and gas company drilling personnel.
Therefore, TGS organized a Kaggle competition to use computer vision to solve this task with greater efficiency and accuracy.
6. Understanding the data
For simplicity, we will only use the train.zip file that contains the images and their corresponding skins.
In the image catalog, there are 4,000 seismic images, which human experts use to predict whether there may be salt mines in the area.
In the mask directory, there are 4000 grayscale images, which are the actual terrain values of the corresponding images, indicating whether the seismic image contains layers of salt and, if so, where. These will be used to build supervised learning models.