--- Quote Start ---
I wanted to consult one more thing to you guys, the Soft Core Nios II, would it be more easy to use this CPU for my purpose by programming a custom logic? In that case I'd use C language ,right? so in the end will be more easy?
rromano001 yes, i will start by programming as you said, regarding your questions
but now how large is image :
There is not actual image, I want to use a list of numbers, every number (binary) indicate a pixel intensity, I'm reducing everything to the clustering of a list of numbers, which can be 1000 numbers for example to be grouped in 3 clusters by calculating its membership value and centroids.
.
--- Quote End ---
Hi, sorry but we are speaking different language, we cannot help if you don't provide information of what you wish to learn on or build. Please grasp we don't own magic bowl to see what is in your mind or project seems so secret.
So this is a linear image for ex from a linear ccd or bidimentional raster image?
What is your cluster? Or better how it behave to numbers in the memory image?
Why two index are on uik target of processing and same appear on internal indexes of coefficients?????
Are you proficient in mathematical term? So express in an usable form, this equation with no detail is just wasted resources bandwidth and time of all us.
--- Quote Start ---
This is manage by loading only 1 number to a PE to do the calculation of the equation above. It will produce the first Uki and Vk (centroid), by communicating with the other clones PE it will update Vk and will calculate Uki again, and so on.
.
--- Quote End ---
As before this don't resolve where reside input memory array and output memory array so again no knowledge to us how number got fit to PE [aka Processing Element] and so if some of them need buffering to prevent be overwritten in case of same memory area. No difference if on hardwired processor or in some FPGA or discrete logic, the channel caveat are forever the same!
Assuming your finally declared image size of 1KPx then this small amount of memory can simply be allocated on two separated internal M9K based fast memory block and avoid caching and SDRAM arbitration logic...
If you want hlep please start comminicate.
Equation and index itself say just nothing and nothing is a system with too many variable so it remain not solvable due to his mathematical rules.
From your actual I can infer first pixel generate others??? SO it is again impossible at least 4 different number appear in equation...
--- Quote Start ---
So, the difficult part here is to make the PE to perform Subtraction, Adding, Multiplication and Division of fractional numbers.
.
--- Quote End ---
this is quite simple, not trivial but not a problem at all.
Remember this was done in a mechanical way so I why you continue assert it is not feasible in a modern fast FPGA logic???
see here on long time ago history to learn about how they got assembled:
https://en.wikipedia.org/wiki/z1_%28computer%29 so all it was feasible on old machinery is no more feasible now?
Boole and numbers theory are in the long and long long far past.
--- Quote Start ---
and how uki xi vk vl interact between them? As is in the equation.
.
--- Quote End ---
So you continue disregard my question how are index related to input and output memory, are them separated memory [dfferent array in term of C or other computer languages programming] or same string of memory cells?
--- Quote Start ---
and from where are coming inputs and where are going outputs?
I think I will use the memory of the development board to load a table with the 1000 numbers to be distributed to each PE, and output goes to the neighbor PE to the update, and when it finish to do clustering, it will load the results in memory I guess.
--- Quote End ---
PE stand for Processing Elements or Px Pixel element?
Neighbour stand for first element so PE after computing one "cluster of three element store back result to first elements? This need form of parallel addressable FIFO in term of at almost processing number element plus two to preserve moving parallel.
ANd now caveat of your system:
Memory is shared so one access to read and one to write, first processing time has latence of at least the number of processing element reading, after processing end result has to be written back and this burden again memory channel...
This need plan reading and writing memory in burst and fill in the cache .
Talking in term of dual core ARM is on board then you can prepare two task, one working on first "cluster" and second working on third cluster so :
need read first memory cells in number equal to processing elements and store on buffer, buffer has to be large number of processing plus two (every PE need 3 element from your writing):
set write index to 0
set read index to zero too
2 time:
{ .comment again this cannot be done in parallel due to RAM access
shift cellbuffer right one cell .comment this can be paralleled on store operation
read array[read index] and store to last cell .comment this can be paralleled shift operation
increment read index .comment this can be paralleled with great care
}
. comment element index is now at PEn
loop
2 time:
{ .comment again this cannot be done in parallel due to RAM access
shift cellbuffer right one cell .comment this can be paralleled on store operation
read array[readindex] and store to last cell .comment this can be paralleled shift operation
increment readindex .comment this can be paralleled with great care
}
.comment all PEn+2 element get on memory
pass cellbuffer to task1, task2 in parallel
store result 1 to memory[writeindex]
increment writeindex .comment this can be paralleled on store operation
store result 2 to memory[writeindex]
increment writeindex .comment this can be paralleled on store operation
if last element not reached then
continue to loop
else done
At this we can plan two term of reducing starvation of processors due to memory channel congestion...
Starvation was really a great concern on CRAY computers but on modern device with fast communication and ram access still can plague actual system too.
Starvation touch new parallel system and new techniques can be explored, just old mode of batch processing doing one thing at time STILL leave processor cluster starve a lot .
read more memory elements from array using DMA during computation (this in case no need of main memory acces is required, you have two memory so you can separate FPGA from ARM and do it in <super>parallel fashion.
evaluate when balance of advance reading reduce performance in term of bandwidth saturation...
Again your is I assume an exercise and 1K unit is so small don't need optimization than in special cases...
Cells buffer can be special memory with parallel shift logic and possibly pipeline of new feed and shift on number of PE stage shift....
Everything can be built but remember:
Only problem we can solve by manual computation can be solved by automata.
We can apply some trick and clever logic we learnt and we think as new but this cannot help solve the unsolvable too.
Communication is first ability.
You can appear clever at first or just communicating you have no intention to do it.
Happy new year.