Case Analysis of Deep Learning Technology Based on Spark and Bigdl

This paper mainly shares the practical experience of Intel and JD in building a large-scale image feature extraction framework based on spark and bigdl deep learning technology.backgroundImage feature extraction is widely used in similar image retrieval, de duplication and so on. Before using bigdl framework (which will be mentioned later), we tried to develop and deploy feature extraction applications on multi machine, multi GPU card and GPU cluster respectively. However, the above frameworks have obvious disadvantages:

Case Analysis of Deep Learning Technology Based on Spark and Bigdl 1

In the GPU cluster, the resource allocation strategy based on GPU card is very complex, and the resource allocation is prone to problems, such as insufficient remaining video memory, resulting in oom and application crash.In the case of single machine, compared with the cluster mode, developers need to manually do data fragmentation, load and fault tolerance.The application of GPU mode, taking Caffe as an example, has many dependencies, including CUDA, which increases the difficulty of deployment and maintenance. For example, when there are problems with different operating system versions and GCC versions, they need to be recompiled and packaged.

The above problems make the forward program based on GPU face many technical application challenges in architecture.

Let's look at the scene itself. Because the background of many pictures is complex and the proportion of subject objects is usually small, in order to reduce the interference of background on the accuracy of feature extraction, the subject needs to be separated from the picture. Naturally, the framework of image feature extraction is divided into two steps. First, the target is detected by the target detection algorithm, and then the target feature is extracted by the feature extraction algorithm. Here, we use SSD [1] (single shot multibox detector) for target detection and deepbit [2] network for feature extraction.

Jingdong has a large number of (more than hundreds of millions of) product pictures in the mainstream distributed open source database. Therefore, how to efficiently retrieve and process data in large-scale distributed environment is a key problem of image feature extraction pipeline. The existing GPU based solutions face other challenges in solving the requirements of the above scenarios:Data download takes a long time, and the scheme based on GPU can not optimize it well.For the picture data in the distributed open source database, the early data processing process of GPU scheme is very complex, and there is no mature software framework for resource management, distributed data processing and fault tolerance management.

Case Analysis of Deep Learning Technology Based on Spark and Bigdl 2

Because of the limitations of GPU software and hardware framework, it is very challenging to expand GPU scheme to deal with large-scale images.Bigdl integration schemeIn the production environment, using the existing software and hardware facilities will greatly improve the production efficiency (such as reducing the R & D time of new products) and reduce the cost. In this case, the data is stored on the mainstream distributed open source database in the big data cluster. If the deep learning application can use the existing big data cluster (such as Hadoop or spark cluster) for computing, it can easily solve the above challenges.

Intel's open source bigdl project [3] is a distributed deep learning framework on spark, which provides comprehensive deep learning algorithm support. Bigdl can be easily extended to hundreds or thousands of nodes with the distributed scalability of spark platform. At the same time, bigdl uses Intel MKL mathematical computing library and parallel computing technology to achieve high performance on Intel Xeon server (the computing power can be comparable to the performance of mainstream GPU).In our scenario, bigdl is customized to support various models (detection and classification); The model is transplanted from being only applicable to specific environment to bigdl big data environment supporting general model (cafe, torch, tensorflow); The whole pipeline process has been optimized and accelerated.The pipeline for feature extraction in spark environment through bigdl is shown in Figure 1:

Use spark to read hundreds of millions of original pictures from the distributed open source database and build RDDUse spark to preprocess pictures, including resizing, subtracting the mean value, and composing the data into a batchUsing bigdl to load SSD model, large-scale and distributed target detection is carried out on the picture through spark, and a series of detection coordinates and corresponding scores are obtained

The detection result with the highest score is retained as the subject target, and the target image is obtained by cutting the original image according to the detection coordinatesPreprocess the RDD of the target picture, including resizing, to form a batchBigdl is used to load the deepbit model, and spark is used to extract the distributed features of the detected target image to obtain the corresponding features

The detection results (extracted target feature RDD) are stored on HDFSThe whole data analysis pipeline, including data reading, data partitioning, preprocessing, prediction and result storage, can be easily implemented in spark through bigdl. On the existing big data cluster (Hadoop / spark), users can run deep learning applications using bigdl without modifying any cluster configuration. Moreover, bigdl can easily be extended to a large number of nodes and tasks by using the high scalability of spark platform, so it greatly speeds up the data analysis process.In addition to the support of distributed deep learning, bigdl also provides many easy-to-use tools, such as image preprocessing library, model loading tools (including loading models of third-party deep learning framework), which is more convenient for users to build the whole pipeline.

Image preprocessingBigdl provides an image preprocessing library [4] Based on OpenCV [5], which supports various common image conversion and image enhancement functions. Users can easily use these basic functions to build an image preprocessing pipeline. In addition, users can also call the opencv function provided by the library to operate custom image conversion.The preprocessing pipeline of this sample converts an original RDD into a batch RDD through a series of transformations. Among them, bytetomat converts the byte picture into the mat storage format of OpenCV, resize adjusts the size of the picture to 300x300, and mattofloats saves the pixels in mat into the format of float array and subtracts the average value of the corresponding channel. Finally, roiimagetobatch forms the data into a batch, which is used as the input of the model for prediction or training.

Loading modelUsers can easily use bigdl to load the pre trained model and use it directly in Spark Program. Given the bigdl model file, you can call module.load to get the model.In addition, bigdl also supports the import of third-party deep learning framework models, such as Caffe, torch and tensorflow.

Users can easily load the trained model for data prediction, feature extraction, model fine-tuning and so on. Taking Caffe as an example, Caffe's model consists of two files, model prototext definition file and model parameter file. As shown below, users can easily load the pre trained Caffe model into spark and bigdl programs.performanceWe benchmark the performance of the GPU cluster solution based on Caffe and the Xeon cluster solution based on bigdl. The tests are running in JD's internal cluster environment.

Test standardEnd to end picture processing and analysis pipeline, including:Read pictures from distributed open source database (download pictures from picture source to memory)

Input to the target detection model and feature extraction model for feature extractionSave the results (picture paths and features) to the file systemNote: the download factor has become an important factor affecting the overall end-to-end throughput. In this case, this part of the processing time accounts for about half of the total processing time (download detection features). The GPU server cannot use GPU to accelerate the processing of downloading.

testing environmentGPU: NVIDIA Tesla K40, 20 cards executed concurrentlyCPU: Intel (R) Xeon (R) CPU e5-2650 V4 @ 2.20GHz, 1200 logical cores in total (each server has 24 physical cores, enables hyper threading, and is configured as 50 logical cores of yarn)

test resultFigure 2 shows that Caffe's throughput of 20 K40 concurrent processing pictures is about 540 pictures / s, while bigdl's throughput on the yarn (Xeon) cluster with 1200 logical cores is about 2070 pictures / s. The throughput of bigdl on Xeon cluster is about 3.83 times that of GPU cluster, which greatly shortens the processing time of large-scale images.The test results show that bigdl provides better support in large-scale image feature extraction applications. Bigdl's high scalability, high performance and ease of use help JD more easily cope with the massive and explosive growth of picture scale. Based on such test results, JD is upgrading the implementation of Caffe image feature extraction based on GPU cluster to bigdl scheme based on Xeon cluster and deploying it to spark cluster production environment.

Figure 2 compare the throughput of K40 and Xeon in the picture feature extraction pipelineBigdl's high scalability, high performance and ease of use help JD more easily use deep learning technology to process massive images. JD will continue to apply bigdl to a wider range of in-depth learning applications, such as distributed model training.quote

[1]. Liu, Wei, et al. SSD: Single Shot MulTIbox Detector. European conference on computer vision. Springer, Cham, 2016.[2]. Lin, Kevin, et al. Learning compact binary descriptors with unsupervised deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern RecogniTIon, 2016.[3]. https://github.com/intel-analyTIcs/BigDL

[4]. https://github.com/intel-analytics/analytics-zoo/tree/master/transform/vision[5].

recommended articles
no data
Shenzhen Tiger Wong Technology Co., Ltd is the leading access control solution provider for vehicle intelligent parking system, license plate recognition system, pedestrian access control turnstile, face recognition terminals and LPR parking solutions.
no data

Shenzhen TigerWong Technology Co.,Ltd

Tel: +86 13717037584

E-Mail: info@sztigerwong.com

Add: 1st Floor, Building A2, Silicon Valley Power Digital Industrial Park, No. 22 Dafu Road, Guanlan Street, Longhua District,

Shenzhen,GuangDong Province,China  


Copyright © 2024 Shenzhen TigerWong Technology Co.,Ltd  | Sitemap
Contact us
contact customer service
Contact us
Customer service