Now that I have FFT working in real time on my FPGA, here are some musings about what else I need to consider.
The FFT core doesn't have any built-in windowing functions. It should be possible to add this by looking at the FFT's input index and adjusting the sample appropriately. Some information on a Hamming window FPGA implementation is available at http://www.polar-design.de/html/dsp-artikel_englisch.html .
Use a ROM table of coefficients and multipliers. If the multiplier is fully pipelined, this should be very fast and only require a single multiplier per channel (and one lookup ROM for all four channels!) I should totally do something like this.
Rather than using the streaming FFT implementation, I could use dual port RAM and offload samples at 50 MHz. Once a block is stored, these could be processed with FFT at a much higher speed (100-200 MHz). This might make it easier to adjust the sampling rate (by changing the external clock source) independently of other FPGA functions, with only a fairly small increase in complexity.
On the other hand, there isn't a lot to be gained by moving away from the streaming architecture, except that I can use the multiple channel FFT core, which would be useful if I become resource constrained - with the streaming architecture I'll use up 100% of the DSP48A1 resources and I'm not sure if I'll need more of these elsewhere (actually, I'll need four to use as multipliers for the windowing function, but this could be done with LUTs). I could just as easily put block RAM after the FFT's output. The four-channel implementation uses the same number of DSP48A1s as the single channel streaming version, but has four times the latency (or less if I fiddle with the architecture type, which influences the calculations in ways I can't remember), so it may still need to be run at double speed and duplicated twice, which seems like a lot more effort unless I really need to do it.
I can generate windowing coefficient data in Matlab as follows (based on http://esl.eng.ohio-state.edu/~rstheory/iip/window.pdf):
clear all N = 4096; B = 10; % bits w = hamming(N); w = w/max(w); w = round(w.*((2^(B-1))-1)); filename = 'window.coe'; fid = fopen(filename,'wt'); fprintf(fid,'memory_initialization_radix=10;\n'); fprintf(fid,'memory_initialization_vector=\n'); for i=1:(N/2)-1 fprintf(fid,'%d,\n', w(i)); end fprintf(fid,'%d;\n', w(N/2)); fclose(fid);
Because N is an even number, the function produces a symmetric output. To conserve FPGA resources, I'll use an up/down counter to generate the second half, or a trick like cnt[9] ? ~cnt[8:0] : cnt[8:0]
This is now done. Max speed after PAR is still very good (122 MHz).
Next steps:
These approaches both look like they'd require a lot of RAM - it's probably a good idea to start using the off-chip memory as a buffer. Store single FFT output in block RAM and then transfer that to external RAM. Accumulate a bunch of those, and then average them out at a higher clock speed.
In the current hardware implementation, Fs is 50e6 and the FFT is 4096 points. Assuming that radio transmissions are 15 ms in length, this means that 183 FFTs can be performed per transmission. Averaging over 32 FFTs would mean that about five fixes could be made per transmission. It would be fairly easy to tune these parameters at design time.
To do:
Implement memory interface core and start feeding raw ADC data into it.
Get around to implementing a very simple GMII interface - shouldn't take too long.