I'd check the raw CF write speed to see if you can actually write data fast enough.
Try with writes of multiple sectors (single data xfer for each sector) and look to see if the CF supports multi-sector writes (single data xfer for (say) 4 sectors).
IIRC the CF ATA command set also includes an 'erase sectors' command, that may have an effect, and may make writes faster - I didn't try it when I wrote CF support (many years ago).
The CF ATA command sequence using PIO is fairly simple so shouldn't be that difficult to test. You will get faster writes using DMA (either using dma mode transfers, or pio mode using dma for the data copy) - but you can probably work out how much that might save by workign out how long the copy itself takes.
FAT16 is slightly simpler than FAT32, if you use large clusters the you can get quite big files. Writing FAT16 code that is specific for your application (rather than using the generic FAT code) should let you get filesystem writes near to the physical write speed limit.