This is a public Blog  publicRSS

Entry

    DanO
    LinkedIn’s Elegant Solution to Monitoring Network P...
    Entry posted May 8, 2018 by DanORegular, tagged Best Practices, Member Spotlight, Product / Product Release 
    358 Views, 9 Comments
    Title:
    LinkedIn’s Elegant Solution to Monitoring Network Performance
    Entry:

    I’ve been working with Oracle Service Cloud since 2007 (back when it was still RightNow). For the past five years I’ve managed LinkedIn’s Oracle Service Cloud implementation, handling everything from standard administration tasks and managing workspaces, workflows and business rules to doing C# development for Add-Ins.

    Recently, I had an “aha” moment. I discovered a way to monitor network performance for Oracle Service Cloud (and beyond!). This solution changed the game for LinkedIn, and I wanted to share in hopes it might help others.

    Challenge: Doing Root Cause Analysis of Performance Complaints

    LinkedIn has been using Oracle Service Cloud for several years across our contact centers, located in primarily in the U.S., Dublin, Singapore, and Bangalore. During this time our global team has reported that Oracle Service Cloud was s-l-o-w or crashing from time to time.

    Our Oracle Service Cloud administration team would investigate these issues, but it was extremely difficult to know where the issue originated. Was it a problem with Oracle Service Cloud, the internet connection for a specific LinkedIn location, or the agent’s computer? Were our scheduled reports and system utilities running at 12am CST causing performance issues for our Bangalore team? To make things harder our administration team is primarily based in the U.S., so we couldn’t always troubleshoot our global team’s issues real-time.

    Despite our best efforts at resolution we didn’t have a consistent, efficient solution for identifying the root cause of Oracle Service Cloud performance complaints. It was an ongoing headache and black hole of effort for our team.

    Solution: Create Network Performance Monitor Add-In

    During a discussion around our need for visibility into network performance issues, a thought occurred: the Add-In framework was always running in the background within Oracle Service Cloud and I could set up an Add-In to create an ongoing log of network performance metrics across all of our office locations and remote agents.

    Diagram of LinkedIn's Network Performance Monitor Add-In SolutionI spent a day designing, building and testing a solution using:

    • A custom object to store a wide array of network performance data
    • A lightweight Add-In that would ping five globally accessible sites: Oracle Service Cloud chat server, Oracle Service Cloud production server, LinkedIn.com, Google.com and Facebook.com. The Add-In starts logging when the user logs in and runs in the background every 15 minutes until they logout. I tried to find a balance between storage space created by these logs and having enough data to be useful. For example, an hour between each monitoring cycle is too long to determine how the network was performing at the time of a complaint.

    This solution generates several data points on network performance across five sites for the entire day (because users start at different times) for every Oracle Service Cloud agent worldwide at LinkedIn.

    Benefits: Faster Root Cause Analysis, Improved Performance and More!

    Now when we receive a complaint about Oracle Service Cloud performance, we can easily check the logs from our Network Performance Monitor Add-In and quickly tell if a network issue is/was the cause. We can also then identify if it was isolated to the Oracle Service Cloud servers, network issues at a specific LinkedIn office, or only impacting a specific agent’s local machine.

    Having consistent, high-level visibility into network performance saves my team countless hours of troubleshooting! Instead of working with IT and 15 other teams trying to gather and analyze data, we now have a good starting point and can identify the root cause significantly faster.

    While this customization was designed to help with troubleshooting, we’ve experienced many other benefits, including:

    • We understand how ping times vary across locations and have clearer expectations on performance across different channels (e.g. “How fast is chat in Bangalore?”). We can run performance benchmarking across locations.
    • We don’t have to ask users to send screenshots, run network traces, try other browsers or applications. Ironically, we have many of people say, “No, everything else is running fine,” but the network performance monitor logs tell a different story!
    • We submit fewer service requests to Technical Support unless it’s an Oracle Service Cloud-specific issue.
    • We share network performance data from our logs with our local IT teams when there are issues, so they can pinpoint those specific timeframes in their own logs to see what is going on. We can use this data to drive network performance improvements in our support locations.

    Since we can nail down the source of performance complaints, we’ve had fewer inaccurate reports of Oracle Service Cloud performance issues, and we have saved a significant amount of time on troubleshooting issues ultimately not related to Oracle Service Cloud! In short, this relatively simple Oracle Service Cloud customization has created huge value.

    Advice: Sharing A Few Tricks of the Trade

    If you’re interested in creating a Network Performance Monitoring Add-In for your organization here are some things to keep in mind:

    • First, have a clear picture of what you’re doing and why. The specifics of our solution might not make sense for your organization.
    • Timer events are your friend. You can configure it to kick off every minute, five minutes, 15 minutes, etc.
    • Pick an Add-In that is always running, instead of a conditional Add-In (e.g. report Add-Ins can only run when the related report is open). I used a Navigation Section Add-In that loads as soon as someone logs in and continues to run the entire time in the background - the user doesn’t even know it’s there, and it has no performance impact if you (correctly) use threading / tasks.
    • The .NET framework allows you to set up a server config variable in OSC which gives admins the ability to change those property values without requiring a developer. I used the server config variable to set our timer event interval which allows the admin team to change the default value from 15 minutes to one, five, or 10 minutes, etc. This also enables the admin team to change the list of URLs the Add-In pings on the fly. They can make adjustments based on specific business scenarios without having to engage a developer to make these changes in the code.
    • Test, test, test! Verify your Add-In is working like you expected by plugging in sample data and watch it work. Manually verify that the data being collected matches your expectations in testing.

     

    I hope this helps other organizations who may be struggling with similar issues and encourages you to take a step back when faced with common challenges and look for an entirely different solution. I’d love to hear your feedback on our solution or any other ways you’ve effectively dealt with this sort of challenge.

    Comment