The technical principle of image recognition will not be as simple as

For humans, describing what we see in our eyes, the "visual world," seems too insignificant, so that we don't realize that it is exactly what we are doing all the time. When you see something, whether it's a car, a big tree, or a person, we usually call out the name without thinking too much. However, for a computer, it is quite difficult to distinguish between “human objects” (eg, “human objects” in “non-human objects” such as puppies, chairs, or alarm clocks).

2.png

Being able to solve this problem can bring very high benefits. "Image recognition" technology, more broadly "computer vision" technology, is the foundation of many emerging technologies. From driverless cars and facial recognition software to seemingly simple but important developments – “smart factories” that monitor pipeline defects and violations, and automation software that insurance companies use to process and classify claims photos. These emerging technologies are inseparable from "image recognition."

In the following sections, we will explore the issues and challenges of “image recognition” and analyze how scientists use a special neural network to solve this challenge.

Learning to "see" is a difficult, high-cost task

To solve this problem, we can first apply metadata to unstructured data. In the previous article, we described some of the problems and challenges encountered in text content classification and search in the absence of metadata scarcity or metadata. It is a daunting task for people to manually classify and mark movies and music. But some tasks are not only arduous, but even impossible. For example, training a navigation system in a driverless car to distinguish other vehicles from pedestrians crossing the road; or to mark and classify thousands of photos and videos uploaded by users on social networking sites every day. And screening.

The only way to solve this problem is the neural network. In theory, we can use conventional neural networks for image analysis, but in practice, from a computational point of view, the cost of using this method is very high. For example, a conventional neural network, even if it is processing a very small image, assuming 30*30 pixel images, still requires 900 data inputs and more than 500,000 parameters. Such processing is still feasible for a relatively powerful machine; however, if a larger image needs to be processed, assuming an image of 500 by 500 pixels, the number of data inputs and parameters required by the machine will be greatly increased. Increase to the point where it is unimaginable.

In addition, the use of neural networks for "image recognition" may also lead to another problem - overfitting. Simply put, overfitting refers to the phenomenon that the data trained by the system is too close to the custom data model. Not only will this generally result in an increase in the number of parameters (ie, an increase in further calculations), but it will also undermine the normal functioning of other conventional functions of "image recognition" in the face of new data.

The real solution - convolution

Fortunately, we have found that a small change in the structure of the neural network can make the processing of large images more operative. The transformed neural network is called the "convolution neural network", also known as CNNs or ConvNets.

One of the advantages of neural networks is its universal adaptability. However, as we have just seen, this advantage of neural networks is actually a disadvantage in image processing. The "convolutional neural network" can make a conscious trade-off for this - in order to get a more feasible solution, we sacrificed other universal functions of the neural network and designed a network dedicated to image processing.

In any image, the correlation between proximity and similarity is very strong. To be precise, the "convolutional neural network" takes advantage of this principle. Specifically, two adjacent pixels in one image are more relevant than two separate pixels in the image. However, in a conventional neural network, each pixel is connected to a separate neuron. As a result, the computational burden is naturally aggravated, and the increased computational burden is actually reducing the accuracy of the network.

Convolutional networks solve this problem by cutting many unnecessary connections. In technical terms, the “convolution network” filters unnecessary connections according to the degree of association, which makes the image processing process more computationally operative. The "convolution network" intentionally limits the connection, allowing a neuron to accept only small segments of input from previous layers (assuming 3x3 or 5x5 pixels), avoiding excessive computational burden. Therefore, each neuron only needs to be responsible for processing a small portion of the image (this is very similar to how our human cerebral cortex works - each neuron in the brain only needs to respond to a small part of the overall visual field).

The inner secret of "convolutional neural network"

How does the “convolutional neural network” filter out unnecessary connections? The secret lies in two new and added layers - the convolutional layer and the converging layer. We will then pass a practical case: let the network determine whether there is a "grandmother" in the photo, and decompose the operation of the "convolution neural network" and describe them one by one.

The first step, "convolution layer." The "convolution layer" itself actually contains several steps:

1. First, we will break down the grandma's photo into some 3×3 pixel, overlapping mosaic tiles.

2. Then, we run each tile on a simple, single-layer neural network, keeping the trade-offs constant. This action will turn our mosaic tiles into a group of graphs. Since we originally decomposed the original image into small images (in this case, we broke it into 3×3 pixel images), the neural network used for image processing is also better.

3. Next, we will arrange these output values ​​in the group of figures, using numbers to represent the content of each area in the photo. The axes represent height, width and color. Then we get a three-dimensional numerical representation of each tile. (If we are not talking about grandma's photo, but video, then we will get a four-dimensional numerical expression.)

After finishing the "convolution layer", the next step is the "convergence layer."

The "aggregation layer" combines the spatial dimensions of this three-dimensional (or four-dimensional) group with the sampling function, and outputs an associative array containing only the relatively important parts of the image. This joint array not only minimizes the computational burden, but also effectively avoids overfitting.

Finally, we will use the sampled array from the "aggregation layer" as a regular, omnidirectionally connected neural network. Through convolution and aggregation, we have greatly reduced the number of inputs, so the array size we get at this time is completely normal for a normal network. Not only that, but this array also retains the most important of the original data. section. The output of this last step will eventually show how much the system has the confidence to make a "grandmother in the photo" judgment.

The above is just a brief description of the working process of the "convolution neural network". In reality, the working process is more complicated. In addition, unlike our case here, the actual "convolutional neural network" process generally contains hundreds or even thousands of tags.

Implementation of "Convolutional Neural Network"

Restarting a "convolutional neural network" is a very time consuming and expensive task. However, many APIs have recently been implemented—allowing organizations to complete the collection of image analysis without the help of internal computer vision or machine learning experts.

"Google Cloud Vision" is Google's visual identity API, which is based on the open source TensorFlow framework and uses a REST API. Google Cloud Vision includes a fairly comprehensive set of tags that detect individual objects and faces. In addition, it has some additional features, including OCR and "Google Image Search."

The "IBM Watson Visual Identity" technology is an important part of the "Watson Cloud Developer". Although it covers a large number of built-in clusters, it is actually trained to customize the cluster based on the images you provide. Like "Google Cloud Vision", "IBM Watson Visual Identity" also has many excellent features, such as OCR and NSFW detection.

Clarif.ai is a "up-and-coming" of image recognition services that uses a REST API. It is worth mentioning that Clarif.ai contains a large number of units that can be customized according to specific scenarios. Like weddings, travel and even food.

The above APIs are more suitable for some common programs, but for some special tasks, it may still be necessary to "prescribe the right medicine" and develop a special solution. Fortunately, however, many databases can handle computational and optimization efforts, which more or less relieves the pressure on data scientists and developers to focus more on model training. Among them, most of the databases, including TensorFlow, Deep Learning 4J and Theano, have been widely and successfully applied.

BP Monitor

Blood Pressure Monitor,Blood Pressure Machine,Blood Pressure Checker,Digital Blood Pressure Monitor

GANSU PINGLINAG ABAY SCIENCE&TECHNOLOGY CO.,LTD , https://www.yzwtech.com