In general, the smaller models (<3b parameters) seemed to struggle while medium sized (>8b) (and likely larger models) seemed to do much better. Code tailored models like codellama and others likely ...