Site Reliability Engineer
2 weeks ago
At AccelByte, our mission is to empower game creators by providing them with the backend platform and tools required to make scalable, reliable AAA-quality games. The company was founded in 2016 by industry veterans who have engineered online systems for some of the largest game and distribution platforms in the world including Fortnite, Epic Store, Xbox Live, PlayStation Network, and EA Origin. We are backed by top investors including Softbank, Sony Interactive Entertainment, Galaxy Interactive, NetEase, and Krafton. Our latest Series B funding has firmly solidified our place as a top player in the gaming industry. AccelByte's talent has decades of experience building and shipping some of the largest game and distribution platforms in the world.
We believe that the best companies empower employees to make decisions, obsess about the best user experience, and are not afraid to make and learn from their mistakes. Our culture is based on humility, openness to feedback, drive, and collaboration, which we feel results in the best performing teams. As a company that values diversity, inclusion, and employee growth, our employees have opportunities to work with and learn from teams all over the world. We offer competitive salaries, a full range of health benefits, social activities, career growth opportunities, and an amazing team. Come join us
**Position Summary**
As an SRE/Cloud Engineer, your primary responsibility revolves around enhancing the observability of our infrastructure. You play an important role in strategically optimizing resources and driving initiatives to ensure effective infrastructure management aligned with business objectives. Your focus lies in implementing tools and practices that enable comprehensive monitoring, logging, and tracing of system components and processes. By doing so, you contribute to improving system reliability, troubleshooting efficiency, and overall operational transparency.
**Essential Functions/Responsibilities**
The SRE/Cloud Engineer is accountable for the following functions and responsibilities:
- Configure and maintain monitoring tools (Prometheus, Grafana, AWS CloudWatch) for real-time visibility into system performance and health.
- Enhance observability strategies and tools to monitor the performance, availability, and reliability of distributed systems.
- Maintain robust monitoring and alerting solutions for timely issue detection and resolution.
- Promote best practices in observability, including logging, tracing, and metrics collection within development teams.
- Utilize Kubernetes (K8s) for container orchestration, scalability, reliability, and efficient resource utilization.
- Assist in performance analysis, capacity planning, and optimizing system performance and resource utilization.
- Identify and address bottlenecks, inefficiencies, and potential failure points in the system.
- Assist in creating and enforcing cost control measures, monitor AWS resource utilization, and identify optimization opportunities to decrease infrastructure costs.
- Implement containerization strategies to improve deployment efficiency and resource utilization in the AWS environment.
- Contribute to the analysis of cloud resource usage patterns and identify opportunities for cost optimization.
- Perform other duties as assigned.
**Qualifications/Experience Required**
- Bachelor's Degree background or relevant work experience, certification, or courses
- At least 3 years of experience specializing in roles such as Site Reliability Engineering (SRE) or similar, with a particular focus on improving observability within distributed systems.
- Experience in designing and implementing log collection, aggregation, and visualization systems using Fluentd, Fluentbit, prom-tail, Loki & LokiQL, Logstash, OpenSearch, and AWS Athena.
- Experience in designing and implementing metric collection, aggregation, and visualization solutions using technologies like Prometheus & PromQL, Grafana, cadvisor, metric-server, and Cloudwatch.
- Practical knowledge of trace collection, aggregation, and visualization methodologies employing tools such as Grafana tempo & TraceQL, tail sampling, and open telemetry.
- Basic experience in Kubernetes, including using Kubectl, flux, and other tools for debugging and modifying cluster states and understanding containerization technology's limitations and usage within a Kubernetes cluster.
- Basic experience in using Infrastructure-as-Code (IaC) tools (e.g., Terraform, Cloudformation) for provisioning and configuration management, including the ability to apply, modify, or delete modules and create custom Terraform modules.
- Basic experience in performing cloud system operations on AWS infrastructure, including backups, snapshots, and other administrative tasks.
- Practical knowledge of defining budgets, forecasting expenses, and building automated tools to identify cost trends and anomalies for cloud infrastructure
- Understanding of distributed systems
-
Site Reliability Engineer
2 weeks ago
Jakarta, Jakarta, Indonesia VIDA Digital Identity Full time $60,000 - $120,000 per yearAbout the job: Site Reliability Engineer (SRE) Lead – Data Center OperationsVIDA Digital Identity is Indonesia's leading provider ofdigital identity verification, digital signature, and trust services, serving enterprises and government institutions with high standards ofsecurity, compliance, and reliability.We are seeking an experiencedSite Reliability...
-
Site Reliability Engineer
7 days ago
Jakarta, Indonesia Abhidi Solution Private Limited Full time**Responsibilities**: - Administer production related jobs - Address production issue - Improve system reliability through configuration or code changes - System monitoring and improve system observability - Remove toil and automate whenever possible - Problem solving, including troubleshoot a production issue **Skills**: - Experience with cloud...
-
Site Reliability Engineer
4 days ago
Jakarta, Jakarta, Indonesia AVOWS TECHNOLOGIES PRIVATE LIMITED Full timeAbout the RoleWe are looking for an experienced Site Reliability Engineerto design, implement, and manage our cloud-based infrastructure onGoogle Cloud Platform (GCP)from the ground up. The ideal candidate will ensure our systems are highly available, reliable, scalable, and efficient while collaborating closely with software engineers to deliver robust...
-
Site Reliability Engineer
1 week ago
Jakarta, Jakarta, Indonesia VIDA Digital Identity Full time $90,000 - $120,000 per yearAbout The JobVIDA Digital Identity is Indonesia's leading provider of digital identity verification, digital signature, and trust services , serving enterprises and government institutions with high standards of security, compliance, and reliability .We are seeking an experienced Site Reliability Engineering (SRE) Lead to drive the reliability, scalability,...
-
Site Reliability Engineer
1 week ago
Jakarta, Jakarta, Indonesia Vida Digital Identity Full time $40,000 - $80,000 per yearJakarta, JakartaWork Type: Full TimeAbout the jobVIDA Digital Identity is Indonesia's leading provider of digital identity verification, digital signature, and trust services, serving enterprises and government institutions with high standards of security, compliance, and reliability.We are seeking an experienced Site Reliability Engineering (SRE) Lead to...
-
Site Reliability Engineer
1 week ago
Jakarta, Jakarta, Indonesia Vida Full time $120,000 - $180,000 per yearAbout the job VIDA Digital Identity is Indonesia's leading provider of digital identity verification, digital signature, and trust services, serving enterprises and government institutions with high standards of security, compliance, and reliability.We are seeking an experienced Site Reliability Engineering (SRE) Lead to drive the reliability,...
-
Site Reliability Engineer
2 weeks ago
Jakarta, Indonesia Ajaib Full timeCompany Description **Job Description**: - Perform day-to-day operations to support developers and DevOps. - Create end-to-end monitoring, logging, and alerting system. - Provide technical assistance to improve system performance, capacity, reliability and scalability - Perform root cause analysis of reliability issues. - Document every action so your...
-
Site Reliability Engineer
1 week ago
Jakarta, Jakarta, Indonesia VIDA Full time $80,000 - $120,000 per yearAbout the jobVIDA Digital Identity is Indonesia's leading provider of digital identity verification, digital signature, and trust services, serving enterprises and government institutions with high standards of security, compliance, and reliability.We are seeking an experienced Site Reliability Engineering (SRE) Lead to drive the reliability,...
-
Site Reliability Engineer
2 weeks ago
Pondok Indah Office Tower lt. , Jl. Sultan Iskandar Muda, Jakarta Selatan. DKI Jakarta, Jakarta, Indonesia Catalyst Full time $100,000 - $120,000 per yearAs a Site Reliability Engineer / DevOps, you will create monitoring and alerting systems, maintain legacy applications, and provide technical assistance to improve system performance, capacity, reliability, and scalability. You will also create internal tools to automate or simplify workloads, perform root cause analysis of reliability issues, and provide...
-
Site Reliability Engineer
2 weeks ago
Jakarta, Indonesia Digital Muda Solutions Full timeDeskripsi: - Menjaga ketersediaan, kehandalan, dan performa sistem dengan fokus pada infrastruktur teknis, keamanan, dan skala pengguna. - Berkolaborasi dengan tim pengembangan dan operasi untuk merancang, menguji,dan menerapkan praktik terbaik dalam infrastruktur teknologi, serta melakukan perbaikan dan peningkatan sesuai kebutuhan. - Memastikan integrasi...