Forum Discussion

RubenPadial's avatar
RubenPadial
Icon for Contributor rankContributor
10 months ago

Intel FPGA AI Sutie Inference Engine

Is there any official documentation on the DLA runtime or inference engine for managing the DLA from the ARM side? I need to develop a custom application for running inference, but so far, I’ve only found the dla_benchmark (main.cpp) and streaming_inference_app.cpp example files. There should be some documentation covering the SDK. The only documentation that i found related with is the Intel FPGA AI suite PCIe based design example https://www.intel.com/content/www/us/en/docs/programmable/768977/2024-3/fpga-runtime-plugin.html

From what I understand, the general inference workflow involves the following steps:

  1. Identify the hardware architecture
  2. Deploy the model
  3. Prepare the input data
  4. Send inference requests to the DLA
  5. Retrieve the output data

42 Replies

  • JohnT_Altera's avatar
    JohnT_Altera
    Icon for Regular Contributor rankRegular Contributor

    Hi Ruben,


    I think you might need to only provide new input of data and not changing the blob which will think that this is a new inference setting.


    During the 1st run, you should have performed all the setting and during the second run onwards, you should just provide the input data.


    • RubenPadial's avatar
      RubenPadial
      Icon for Contributor rankContributor

      Hello @JohnT_Intel,

      Same behaviour.

      I changed to create the blobs before the loop and only filling them in the loop:


              // Create blobs only once before the loop
              using Blob_t = std::vector<std::map<std::string, Blob::Ptr>>;
              std::vector<std::pair<Blob_t, Blob_t>> ioBlobs = vectorMapWithIndex<std::pair<Blob_t, Blob_t>>(
                  exeNetworks, [&](ExecutableNetwork* const& exeNetwork, uint32_t index) mutable {
                      Blob_t inputBlobs;
                      Blob_t outputBlobs;
                      ConstInputsDataMap inputInfo = exeNetwork->GetInputsInfo();
                      ConstOutputsDataMap outputInfo = exeNetwork->GetOutputsInfo();
                      
                      for (uint32_t batch = 0; batch < num_batches; batch++) {
                          std::map<std::string, Blob::Ptr> outputBlobsMap;
                          for (auto& item : outputInfo) {
                              auto& precision = item.second->getTensorDesc().getPrecision();
                              if (precision != Precision::FP32) {
                                  THROW_IE_EXCEPTION << "Output blob creation only supports FP32 precision. Instead got: " + precision;
                              }
                              auto outputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                              outputBlob->allocate();
                              outputBlobsMap[item.first] = (outputBlob);
                          }
      
                          std::map<std::string, Blob::Ptr> inputBlobsMap;
                          for (auto& item : inputInfo) {
                              Blob::Ptr inputBlob = nullptr;
                              auto& precision = item.second->getTensorDesc().getPrecision();
                              if (precision == Precision::FP32) {
                                  inputBlob = make_shared_blob<PrecisionTrait<Precision::FP32>::value_type>(item.second->getTensorDesc());
                              } else if (precision == Precision::U8) {
                                  inputBlob = make_shared_blob<PrecisionTrait<Precision::U8>::value_type>(item.second->getTensorDesc());
                              } else {
                                  THROW_IE_EXCEPTION << "Input blob creation only supports FP32 and U8 precision. Instead got: " + precision;
                              }
                              inputBlob->allocate();
                              inputBlobsMap[item.first] = (inputBlob);
                          }
      
                          inputBlobs.push_back(inputBlobsMap);
                          outputBlobs.push_back(outputBlobsMap);
                      }
                      
                      return std::make_pair(inputBlobs, outputBlobs);
                  }
              );
      
              std::cout << "Blobs initialized once before the loop.\n";
      
              while (1) {
              ...
                // Fill blobs with new input values (DO NOT re-create them)
                for (size_t i = 0; i < exeNetworks.size(); i++) {
                      slog::info << "Filling input blobs for network ( " << topology_names[i] << " )" << slog::endl;
                      fillBlobs(inputs, ioBlobs[i].first);  // Only fill the existing blobs
                 }
             ...
              }

      Error: dlia_infer_request.cpp:53 Number of inference requests exceed the maximum number of inference requests supported per instance

  • JohnT_Altera's avatar
    JohnT_Altera
    Icon for Regular Contributor rankRegular Contributor

    Hi Ruben,


    I think you might need to try out with OpenVINO example design or other runtime example design to see if it is working from your side (eg. classification_sample_async or object_detection_demo)?


    • RubenPadial's avatar
      RubenPadial
      Icon for Contributor rankContributor

      Hello @JohnT_Intel ,

      Both examples work, but they are intended for CPU/GPU. In addition, they collect multiple input images into a batch and request inference for the entire batch just like the benchmark example. The issue is related to FPGA DLA instantiation. I need to request an inference on every input event. For some reason, this creates a new DLA instance each time instead of reusing the existing one. This leads to an error once the number of inferences reaches five. Do you have any suggestions to address this?

  • JohnT_Altera's avatar
    JohnT_Altera
    Icon for Regular Contributor rankRegular Contributor

    Hi,

    Do you face any error when running HETERO or you are observing that the code that is intended for CPU/GPU not working?

  • Hello @JohnT_Intel ,

    I mean the original example you suggested is CPU/GPU indeded.

    The real problem is how inference are manged. The examples collect multiple input images into a batch and request inference for the entire batch. I need to request an inference every time a new data is available. That's when the DLA instatiation problem arises.