Our company is looking for a Services Operations Center to join our growing team!
This is a great opportunity to be part of one of the fastest-growing infrastructure companies in history, an organization that is in the center of the hurricane being created by the revolution in artificial intelligence.
Our company is the data platform company for the AI era.
We are building the enterprise software infrastructure to capture, catalog, refine, enrich, and protect massive datasets and make them available for real-time data analysis and AI training and inference.
Designed from the ground up to make AI simple to deploy and manage, our company takes the cost and complexity out of deploying enterprise and AI infrastructure across data center, edge, and cloud.
Our success has been built through intense innovation, a customer-first mentality and a team of fearless people who leverage their skills & experiences to make real market impact.
This is an opportunity to be a key contributor at a pivotal time in our companys growth and at a pivotal point in computing history.
Overview:
The Services Operations Center (SOC) is a team within Customer Success whose function is to monitor the quality of service of company Data clusters deployed in the field, and take the necessary actions in the case of service degradation or outage.
The person that works in a SOC is a SOC Operator.
This person will be looking at a dashboard of monitored items, and when green lights turn red, they will click on the red light, read a runbook, and endeavor to resolve the issue.
If they can fix the problem in 30 mins, they will.
If they can not fix the problem, they will declare an incident and page in Support who will begin troubleshooting, and begin keeping a timeline of events.
If Support can’t fix the problem, the SOC Operator will page in R&D.
The SOC Operator will ‘run’ the incident until the issue is resolved.
Once resolution occurs, the SOC operator close out ticket, and publish a Preliminary Findings Report in the ticket.
All the while, the SOC Operator will provide a play-by-play in the internal slack channel to keep everyone aware of what’s happening.
Job Summary:
As a SOC Operator, you will be responsible for monitoring and maintaining the health and performance of our fleet of installed clusters.
You will work in a 24/7 operations environment, ensuring the availability, reliability, and security of services.
This role involves real-time monitoring, incident detection, incident management, incident resolution, and clear written and verbal communication with other teams and stakeholders.
Responsibilities:
Monitor clusters using internal monitoring tools to detect and troubleshoot issues promptly.
Respond to alerts and incidents in a timely manner, following standard operating procedures (SOPs) and escalation processes.
Perform initial investigation and diagnosis of problems, escalating complex issues to support or R&D.
Document incidents, including their details, troubleshooting steps, and resolutions in the incident tracking system.
Collaborate with other teams, including Support, R&D, Account teams, and customers to ensure effective incident resolution and communication.
Conduct routine checks and audits to identify potential problems or vulnerabilities.
Assist with the implementation of changes and updates to the infrastructure as directed by team leads.
This is a great opportunity to be part of one of the fastest-growing infrastructure companies in history, an organization that is in the center of the hurricane being created by the revolution in artificial intelligence.
Our company is the data platform company for the AI era.
We are building the enterprise software infrastructure to capture, catalog, refine, enrich, and protect massive datasets and make them available for real-time data analysis and AI training and inference.
Designed from the ground up to make AI simple to deploy and manage, our company takes the cost and complexity out of deploying enterprise and AI infrastructure across data center, edge, and cloud.
Our success has been built through intense innovation, a customer-first mentality and a team of fearless people who leverage their skills & experiences to make real market impact.
This is an opportunity to be a key contributor at a pivotal time in our companys growth and at a pivotal point in computing history.
Overview:
The Services Operations Center (SOC) is a team within Customer Success whose function is to monitor the quality of service of company Data clusters deployed in the field, and take the necessary actions in the case of service degradation or outage.
The person that works in a SOC is a SOC Operator.
This person will be looking at a dashboard of monitored items, and when green lights turn red, they will click on the red light, read a runbook, and endeavor to resolve the issue.
If they can fix the problem in 30 mins, they will.
If they can not fix the problem, they will declare an incident and page in Support who will begin troubleshooting, and begin keeping a timeline of events.
If Support can’t fix the problem, the SOC Operator will page in R&D.
The SOC Operator will ‘run’ the incident until the issue is resolved.
Once resolution occurs, the SOC operator close out ticket, and publish a Preliminary Findings Report in the ticket.
All the while, the SOC Operator will provide a play-by-play in the internal slack channel to keep everyone aware of what’s happening.
Job Summary:
As a SOC Operator, you will be responsible for monitoring and maintaining the health and performance of our fleet of installed clusters.
You will work in a 24/7 operations environment, ensuring the availability, reliability, and security of services.
This role involves real-time monitoring, incident detection, incident management, incident resolution, and clear written and verbal communication with other teams and stakeholders.
Responsibilities:
Monitor clusters using internal monitoring tools to detect and troubleshoot issues promptly.
Respond to alerts and incidents in a timely manner, following standard operating procedures (SOPs) and escalation processes.
Perform initial investigation and diagnosis of problems, escalating complex issues to support or R&D.
Document incidents, including their details, troubleshooting steps, and resolutions in the incident tracking system.
Collaborate with other teams, including Support, R&D, Account teams, and customers to ensure effective incident resolution and communication.
Conduct routine checks and audits to identify potential problems or vulnerabilities.
Assist with the implementation of changes and updates to the infrastructure as directed by team leads.
Requirements:
High school diploma or equivalent; a degree or certification in information technology or a related field is a plus.
Proven experience as a SOC Operator or in a similar network monitoring role is preferred.
Strong understanding of networking concepts, protocols, and technologies (TCP/IP, SNMP, DHCP, DNS, etc).
Ability to work independently and collaboratively in a team-based environment.
Excellent problem-solving and analytical skills, with the ability to multitask effectively.
High school diploma or equivalent; a degree or certification in information technology or a related field is a plus.
Proven experience as a SOC Operator or in a similar network monitoring role is preferred.
Strong understanding of networking concepts, protocols, and technologies (TCP/IP, SNMP, DHCP, DNS, etc).
Ability to work independently and collaboratively in a team-based environment.
Excellent problem-solving and analytical skills, with the ability to multitask effectively.
This position is open to all candidates.