We propose using the Global Dynamic Critical Path to diagnose system-wide bottlenecks using representative benchmarks to direct embedded SoC optimizations and provide real-world experience of implementing the global critical path (GCP) analysis framework on a Globally-Asynchronous Locally-Synchronous (GALS) SoC built around the LEON3 CPU. We perform our analysis at the RTL level and extend our evaluation to abstract RTL models. We use the power-delay product as the example cost function for optimization; we can adjust the power-delay by tuning the frequency of the clock domains of each SoC IP block. We show that the GCP optimization framework can accommodate other cost functions as well, while effectively directing SoC optimization efforts. Our case studies demonstrate that the GCP algorithm can converge quickly to solutions even in the very large (exponential) search spaces describing permissible SoC configurations, with no designer intervention (for instance, we find the solution of a 6-dimensional space with 19000 configurations in 11 steps). Even though our initial implementation relies on manual source code instrumentation, we only add 1% extra lines of code to the design. This represents annotating less than 0.2% of the ports of the overall Multi-processor SoC design.