Meta is a social networking company based in Menlo Park, CA. My work focused on the Infra Cloud Service Platform (ICSP) which was launched to directly address infrastructure growth, both in terms of scalability for ai/ML capacity and overall complexity. It’s one of the core Infrastructure products, aimed at reducing the complexity of operating services and improving service development efficiency. The goal is to reduce the complexity in developing and operating services for teams at Meta. In the long term they are targeting 100% of all services to be supported and migrated into ICSP.
ICSP’s goal is to also set standardization and homogeneity across all internal infrastructure by creating a unified data model of a service to build around. Additionally, ICSP would provide clear, authoritative solutions for common service needs for both simple and complex services. The platform would expose the full power of service management. Functionality would be discoverable, documented and well-maintained. Metrics for success are measured in reduced SEV events (or reduced time to resolve) that stress monetization efforts.
my focus was on ownership of integrating Pacing for Change Safety and the entirety of unified change safety.
pacing is the ability to track a code change in “stages”, from inception to Landing. Each code change is cautiously deployed in increments through each phase, with a test for success in each phase, to ensure proficiency, health and actual "safety" in quality.
Change Safety is the ability to safely make changes to a Service which is a critical part of Service Management, achieved through a combination of pre-deployment testing and gradual rollout policies. Change Safety is fragmented outside of ICSP and it’s up to Service Owners to setup the different tools for each type of change whether its related to testing, delivery methods, policies or health checks.
In short, it’s a unique opportunity to pull many levers, increase reliability of Services “out of the box”, reduce safety-related toil for Service Owners, and reduce the cost of maintaining Meta’s infra toolbox. However unless there’s an active drive towards unification, there’s a risk of ending up like just another fragmented solution space, especially as more Domains start to think about their safety workflow and infrastructure.
the challenge to create a unified vision for pacing under the change safety umbrella was presented through a fragmented path that many service owners experienced because they had to "travel" through multiple tools to troubleshoot SEv's and maintain their services. the goal of icsp was to create a "one stop shop" to retire many of these exterior tools, but done gradually over multiple years to reduce potentially negative impact.
i had to start considering that the umbrella concept of change safety was a "north star" and pacing integration in icsp would be the first deliverable milestone. So i set out to make a plan that included:
1. outlining a proposal to gain alignment from stakeholders
2. running an introductory ux research session with 1-on-1 whiteboarding workshops
3. presenting the feedback back to the group to maintain alignment
4. the design phase (iterating between designs and 1-on-1's)
5. post-design delivery support for engineering efforts.
2. running an introductory ux research session with 1-on-1 whiteboarding workshops
3. presenting the feedback back to the group to maintain alignment
4. the design phase (iterating between designs and 1-on-1's)
5. post-design delivery support for engineering efforts.
The initial ux research findings concluded with two major categories of improvements for icsp:
system-wide updates: heavy impact / heavy effort
feature-based updates: heavy impact / minimal effort
system-wide updates: heavy impact / heavy effort
feature-based updates: heavy impact / minimal effort
After presenting these findings to the group we were able to add on the feature-based updates to the immediate roadmap while planning long-term for the system-wide updates.
once i got sign off on these recommendations we were able to move into the flow phase of the design work which included some visualizations on how the user moves through the current pacing feature. this was used to expose any pain-points in the process and make sure the expectations matched what the product could actually do.
We were, in fact, building the technology while also designing it (backend and front-end) - so i had to juggle sitting-in and understanding highly technical concepts while also relaying the progress of the project in simpler terms to my fellow design colleagues.
I really felt like this is where I shined in the project, thanks to my years as an engineer I'm able to "speak nerd" quite well. I even started a regular publishing series within the design community called "ELi5" which took highly-technical concepts and broke them down in simpler terms.
anyway, on to the flows! this exercise helped me understand the true path of a paced change through the deployment lifecycle because even though i still mildly think in black-and-white, I'm also a very visual person.
as the low-fi wireframes progressed the team agreed that there were a few infrastructure changes that had to happen - which consisted of combining two fragmented sections into one - to create a major timeline for tracking all changes across each phase of deployment.
the section detailing the changes also had to be updated with adequate tracking data on each phase to give clarity and transparency in the health of the deployment.
keeping all of that in mind we had to also maintain a level of generic flexibility across icsp to accommodate differing services that could potentially onboard with specific requirements.
after we had sign-off on the wireframes it was time to move into the high-fi designs in figma so that engineering efforts could get underway.
after a few review cycles internally, i drafted an RFC (Request for comment) detailing all of the updates that were discussed within the team along with the designs to share-out to the wider engineering group. this allowed for collaboration between external XFN groups that had ties to our efforts, or ties to external tools that would help feed into the improvements made in the new interface.
the new efforts were well-received and minor changes were requested for the order of operations that these updates were made, based on product need and engineering availability.
While engineering proceeded with these new efforts i decided to do a follow-up secondary ux research phase to prove (or disprove) the validity of my designs against the initial Ux research findings.
i split my uxr into two recipient groupings: internal follow-up people and new service owners. That way I could test my design theory against what was already reported and also pass the new designs across fresh potential users that we targeted for onboarding.
i followed the same path as last time: drafting the research proposal, gathering feedback through uxr interviews, and drafting my findings into a presentation to the group.
my uxr findings concluded that fragmentation had been reduced considerably and only small changes could be implemented to upgrade the interface - which were optional!
high-impact changes with minimal engineering effort that could be folded into projects that were already scheduled in the roadmap under similar initiatives.
The lovely thing about designing a ui with an established library is that we could make these changes using out-of-the-box components without any major need for customization.
high impact. minimal effort.
metrics for success:
ICSp is still considered in its infant stage and only very recently went from alpha status to version 1 with live production traffic - so there's not much actual data to go on.
ICSp is still considered in its infant stage and only very recently went from alpha status to version 1 with live production traffic - so there's not much actual data to go on.
although the tracking metrics for success can be measured in how the evolution of this ui has assisted in onboarding some major services into the icsp ecosphere and enabled our infra leaders to invest more people-power towards infrastructure.
looking forward, it's the first major milestone in securing a safe environment for pushing changes through their deployment lifecycle in a safe and error-free manner.