This agent is used as a baseline for finetuning given human expert game play. The agent was trained for two hundred episodes using a deep q network and experience replay.
The agent achieves a score of 46. The agent is able to avoid walls and eat food. However, the agent exhibits some strange behavior. For example, notice how the agent is inneficient with some movement, wrapping around itself before proceeding with its search for food.
Initial training episodes result in low scores as the agent explores policies. There is a significant jump in performance around the one hundredth episode. Performance improvements become progressively smaller as the agent approaches the two hundredth episode.