First, your coding style is inconsistent in many places. there are _n/_p suffixed pairs for next/present state signals, but others are named nxt_/prsnt_ as a prefix or a suffix. This is not good. Another issue of coding style is that you sometimes initialize the counter values straight before starting a loop, i.e. on the transition into the loop, in other cases you rely on the value was cleared after the last loop run somewhere else. This can bring you into trouble some day and doesn’t make reading the code easier. You have some different commands to send over the interface, but three of them are named as such (CMD, CMD_READFAT, CMD_ROOT) while the others (reset, init, writesector) have different naming and are initialized differently, e.g. they don’t use the constant CRC to reflect the trailing x"95" pattern.
Your way of walking the states also looks too low-level to me. I mean, it’s always the same procedure with only slight variations: output data at SCK==0, wait, switch SCK==1, wait, eventually fetch data at SCK==0, wait, switch SCK==1, wait, and so on. This could be either placed into a separate, smaller state machine that is triggered and parametrized by your bigger one, or you could at least put these repetitive operations into a procedure. This would clean up the code significantly. While enhancing readability it will reduce the chance of doing something inconsistently between the different but similar states.
Something more severe is that your main state entity outputs (CS, SCK, SDI) are
combinatorial outputs of the state. This is no no no good, you will have spikes on the wires after most clock edges, which will be problematic especially for your SCK wire. You should at least route those signals through registers, but it will affect your output timing. The easiest way of doing this without losing too much time on the path outside would be a falling_edge() triggered FF for each of these signals on the way out from the internal combinatorial state machine assignments to the entity pins. If this reduces achievable clock speed too much, it would have to be a rising_edge() FF then, but keep an eye on your expected timing.
Your ram/axes might not be synthesizable into block ram wasting quite some registers and/or logic, reducing your clock speed, too (but I haven’t run synthesis on your code). I would re-write it to better reflect the operation of a block ram and allow such an inferral by the tool. I would only enable wr_enbl for 1 cycle when data is really ready to write. I would keep RAM size at powers of two, even if there are two CRC bytes added. Just store those into separate registers. Maintain your addresses in the proper range, say 0 to 511 and just add proper wraparound or limiting functionality to the code. Counters may run, for example, from 0 to 513 if this is beneficial, but it just gives you headaches for addresses and RAM sizes. I would also recommend to use an up-counting range for the memory arrays as this typically better reflect the understanding of a memory dump. Please group your memory initialization into proper sets of, say, 16 values per line and use your ENTER key.
Another thing which might not bring any trouble to you for a long time is the absence of any metastability reduction methodology on the input signal SDO. While in most cases it is just fed into a shift register, in s25 it is used directly to form the count_n/_p and prsnt_/nxt_state signals. If SDO changes right at the time the clock edge appears, some of the bits of count_p and prsnt_state will have been updated while the others have not, leading to an inconsistent state that might not be covered by your VHDL code at all. You might seriously deadlock, depending on the synthesis result. The easiest way of handling metastability issues is to place a two-stage preserved and non-duplicated register pipeline on your signal input. This, again, will change the timing on your interface.
Three previous statements I want to comment: You don’t need a ‘when others=>’ clause for your encoded or enumerated states, at least it doesn’t solve all your problems with invalid states. Adding such a clause will just result in warnings for enumerated states (that have been completely decoded). Second, not adding reset input and handling and relying on proper signal initialization is not bad per se. You just have to watch the warnings telling you which registers could not be implemented with the instructed reset level by the synthesis tool. And, of course, once you hang or need to restart operation after some event, there is no means to restart your block without some rather global means (power off/on, GSR or similar). Third, warnings about missing signals on the sensitivity list are typically only relevant when simulating the code. If you synthesize in absence of wait statements, the sensitivity list is not important, but simulation might lead to a different result if the simulator follows spec regarding the sensitivity list.
Does the reported maximum clock rate fit your needs? Did you constrain that in any way?
Do you use the same clock speed as on the Xilinx part? Does your SD card support these frequencies?
– Matthias