Respectfully, I think you may be misunderstanding what we're talking about here. What you want is a speech driven system. What we need is a system where text, and, almost more importantly, objects, can be spoken to the user. This is the reverse of what you describe.
The original poster never even discussed things like song position, screen location, track name, mute, solo and arm status and so on. We need all that and more.
I know that a lot of this can be inferred but you wouldn't want to guess what's going on by having to figure out what's missing.