Is there any official documentation on the DLA runtime or inference engine for managing the DLA from the ARM side? I need to develop a custom application for running inference, but so far, I’ve only found the dla_benchmark (main.cpp) and streaming_inference_app.cpp example files. There should be some documentation covering the SDK. The only documentation that i found related with is the Intel FPGA AI suite PCIe based design example https://www.intel.com/content/www/us/en/docs/programmable/768977/2024-3/fpga-runtime-plugin.htmlFrom what I understand, the general inference workflow involves the following steps:Identify the hardware architectureDeploy the modelPrepare the input dataSend inference requests to the DLARetrieve the output data

Hi Ruben,Currently we do not have any document publish. Let me check internally if we have any documentation to share out.

Hello @JohnT_Intel ,I know both example applications are based on OpenVINO runtime but I cannot find anything about FPGA and HETERO plugin to make inferences in HETERO:FPGA,CPU mode. This is the documentation I found https://docs.openvino.ai/archives/index.htmlI will very helpful any official documentation from Intel side to make Intel FPGA AI suite really useful.

Hi Ruben,Currently the only documentation is from the OpenVINO tools. When you are using HETERO:FPGA, CPU then the OpenVINO will try the AI in FPGA whenever it is possible and if it is not possible then the layer will be performed in CPU side. The OpenVINO will automatically communicate with the FPGA MMD driver Let me know if you have further queries on this or you need any help on this.

Hello @JohnT_Intel ,But when I use "GetAvailableDevices()" method I only get CPU as available device. There should be something I missed.Form my point of view, there some points to be clarified from the Intel/Altera side to use OpenVINO tool in FPGA devices with FPGA AI Suite.

Hi,You may make use of dla_benchmark apps and modfy from there. The new method should be as using "device_name.find("FPGA")"

Intel FPGA AI Sutie Inference Engine

42 Replies

RubenPadial
Contributor
7 months ago
Hello @JohnT_Intel ,

I mean the original example you suggested is CPU/GPU indeded.
The real problem is how inference are manged. The examples collect multiple input images into a batch and request inference for the entire batch. I need to request an inference every time a new data is available. That's when the DLA instatiation problem arises.
JohnT_Altera
Regular Contributor
7 months ago
Hi,
Do you face any error when running HETERO or you are observing that the code that is intended for CPU/GPU not working?
JohnT_Altera
Regular Contributor
7 months ago
Hi,
May I know how do you run it? Have you run it with FPGA plugin?
- RubenPadial
  Contributor
  7 months ago
  Hello @JohnT_Intel ,
  I used HETERO FPGA plugin
JohnT_Altera
Regular Contributor
8 months ago
Hi Ruben,

I think you might need to try out with OpenVINO example design or other runtime example design to see if it is working from your side (eg. classification_sample_async or object_detection_demo)?
- RubenPadial
  Contributor
  7 months ago
  Hello @JohnT_Intel ,
  Both examples work, but they are intended for CPU/GPU. In addition, they collect multiple input images into a batch and request inference for the entire batch just like the benchmark example. The issue is related to FPGA DLA instantiation. I need to request an inference on every input event. For some reason, this creates a new DLA instance each time instead of reusing the existing one. This leads to an error once the number of inferences reaches five. Do you have any suggestions to address this?

JohnT_Altera

Regular Contributor

9 months ago

Hi Ruben,

I think you might need to only provide new input of data and not changing the blob which will think that this is a new inference setting.

During the 1st run, you should have performed all the setting and during the second run onwards, you should just provide the input data.

RubenPadial

Contributor

8 months ago

Hello @JohnT_Intel,

Same behaviour.

I changed to create the blobs before the loop and only filling them in the loop:

        // Create blobs only once before the loop
        using Blob_t = std::vector<std::map<std::string, Blob::Ptr>>;
        std::vector<std::pair<Blob_t, Blob_t>> ioBlobs = vectorMapWithIndex<std::pair<Blob_t, Blob_t>>(
            exeNetworks, [&](ExecutableNetwork* const& exeNetwork, uint32_t index) mutable {
                Blob_t inputBlobs;
                Blob_t outputBlobs;
                ConstInputsDataMap inputInfo = exeNetwork->GetInputsInfo();
                ConstOutputsDataMap outputInfo = exeNetwork->GetOutputsInfo();
                
                for (uint32_t batch = 0; batch < num_batches; batch++) {
                    std::map<std::string, Blob::Ptr> outputBlobsMap;
                    for (auto& item : outputInfo) {
                        auto& precision = item.second->getTensorDesc().getPrecision();
                        if (precision != Precision::FP32) {
                            THROW_IE_EXCEPTION << "Output blob creation only supports FP32 precision. Instead got: " + precision;
                        }
                        auto outputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                        outputBlob->allocate();
                        outputBlobsMap[item.first] = (outputBlob);
                    }

                    std::map<std::string, Blob::Ptr> inputBlobsMap;
                    for (auto& item : inputInfo) {
                        Blob::Ptr inputBlob = nullptr;
                        auto& precision = item.second->getTensorDesc().getPrecision();
                        if (precision == Precision::FP32) {
                            inputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                        } else if (precision == Precision::U8) {
                            inputBlob = make_shared_blob<PrecisionTrait<Precision::U8>::value_type>(item.second->getTensorDesc());
                        } else {
                            THROW_IE_EXCEPTION << "Input blob creation only supports FP32 and U8 precision. Instead got: " + precision;
                        }
                        inputBlob->allocate();
                        inputBlobsMap[item.first] = (inputBlob);
                    }

                    inputBlobs.push_back(inputBlobsMap);
                    outputBlobs.push_back(outputBlobsMap);
                }
                
                return std::make_pair(inputBlobs, outputBlobs);
            }
        );

        std::cout << "Blobs initialized once before the loop.\n";

        while (1) {
        ...
          // Fill blobs with new input values (DO NOT re-create them)
          for (size_t i = 0; i < exeNetworks.size(); i++) {
                slog::info << "Filling input blobs for network ( " << topology_names[i] << " )" << slog::endl;
                fillBlobs(inputs, ioBlobs[i].first);  // Only fill the existing blobs
           }
       ...
        }

Error: dlia_infer_request.cpp:53 Number of inference requests exceed the maximum number of inference requests supported per instance

RubenPadial

Contributor

9 months ago

Hello @JohnT_Intel ,

As I said, it also included in dla_bechmark as well as the application I shared with you. It doesn't work. Find below the code extracted:

 for (size_t iireq = 0; iireq < nireq; iireq++) {
                            auto inferRequest = inferRequestsQueues.at(net_id)->getIdleRequest();
                            if (!inferRequest) {
                                THROW_IE_EXCEPTION << "No idle Infer Requests!";
                            }
                            
                            if(niter != 0LL){
                                std::cout << "#Debug: 10. Set output blob.\n";
                                for (auto & item : outputInfos.at(net_id)) {
                                    std::string currOutputName = item.first;
                                    auto currOutputBlob = ioBlobs.at(net_id).second[iterations.at(net_id)][currOutputName];
                                    inferRequest->SetBlob(currOutputName, currOutputBlob);
                                }
                                std::cout << "#Debug: 10. Set input blob.\n";

                                for (auto & item: inputInfos.at(net_id)){
                                    std::string currInputName = item.first;
                                    auto currInputBlob = ioBlobs.at(net_id).first[iterations.at(net_id)][currInputName];
                                    inferRequest->SetBlob(currInputName, currInputBlob);
                                }
                            }

                            // Execute one request/batch
                            if (FLAGS_api == "sync") {
                                inferRequest->infer();
                            } else {
                                // As the inference request is currently idle, the wait() adds no additional overhead (and should return immediately).
                                // The primary reason for calling the method is exception checking/re-throwing.
                                // Callback, that governs the actual execution can handle errors as well,
                                // but as it uses just error codes it has no details like ‘what()’ method of `std::exception`
                                // So, rechecking for any exceptions here.
                                inferRequest->wait();
                                inferRequest->startAsync();
                            }

                            iterations.at(net_id) ++;
                            if (net_id == exeNetworks.size() - 1) {
                                execTime = std::chrono::duration_cast<ns>(Time::now() - startTime).count();
                                if (niter > 0) {
                                    progressBar.addProgress(1);
                                } else {
                                    // calculate how many progress intervals are covered by current iteration.
                                    // depends on the current iteration time and time of each progress interval.
                                    // Previously covered progress intervals must be skipped.
                                    auto progressIntervalTime = duration_nanoseconds / progressBarTotalCount;
                                    size_t newProgress = execTime / progressIntervalTime - progressCnt;
                                    progressBar.addProgress(newProgress);
                                    progressCnt += newProgress;
                                }
                            }
                        }

JohnT_Altera
Regular Contributor
9 months ago
Hi Ruben,

I think in C++ it is using below code which is wait()
for (ov::InferRequest& ireq : ireqs) {
ireq.wait();
}
JohnT_Altera
Regular Contributor
9 months ago
Hi Ruben,

You may also refer to 2023.3 version of document from OpenVINO. The sample design can be use to be run with FPGA.Throughput Benchmark Sample — OpenVINO™ documentationCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboard — Version(2023.3)

It has both Python and C++ sample code.
- RubenPadial
  Contributor
  9 months ago
  Hello @JohnT_Intel ,
  
  The same. It has a C++ example but no "wait_all" o similar funcion is used on it. Only in the Python example.
  
  it uses:
  
  for (ov::InferRequest& ireq : ireqs) {
  ireq.wait();
  }
  
  Similar to the code I shared with you.
JohnT_Altera
Regular Contributor
9 months ago
Hi,

I think it should be different. You may refer to openvino.AsyncInferQueue — OpenVINO™ documentationCopy to clipboardBack ButtonFilter Button — Version(2024)

You may also refer to for the example that contain the wait_all. Throughput Benchmark Sample — OpenVINO™ documentationCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardCopy to clipboardBack ButtonFilter Button — Version(2024)
- RubenPadial
  Contributor
  9 months ago
  Hello @JohnT_Intel,
  
  dla_benchmark is implemented in C++.
  The API documentation you shared in the previous comment is for Python. The example that uses wait_all is implemented in Python. There is also an example in C++, but it doesn't use wait_all, waitAll, or any similar function.
  In addition, the OpenVINO documentation is available, but the required OpenVINO version for the latest FPGA AI (2024.3) is 2023.3.
JohnT_Altera
Regular Contributor
9 months ago
Hi Ruben,

Sorry for the delay. If you are using the benchmark source code then you will need to include “wait_all” so the inference is completed before you proceed with new input.

You might want to refer to OpenVINO’s classes instead: https://docs.openvino.ai/2024/openvino-workflow/running-inference/integrate-openvino-with-your-application/inference-request.html
- RubenPadial
  Contributor
  9 months ago
  Hello @JohnT_Intel ,
  The following statement is present in the code I shared with you:
  
  std::cout << "#Debug: 10. waitAll.\n";
  // wait the latest inference executions
  for (auto& inferRequestsQueue : inferRequestsQueues)
  inferRequestsQueue->waitAll();
  
  Is this what you are referring to? It doesn't work. Maybe it is not used correctly. Do you have a pseudocode example?

Forum Discussion

Intel FPGA AI Sutie Inference Engine

42 Replies

Recent Discussions

This is a test post

Agilex 7 M series Open FPGA Stack support

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

OneAPI Support for Agilex 5 and 7 Development Kits