Site Reliability Engineer

Singapore, Singapore | Full-time

Apply

Tokka Labs | Singapore, SG

Tokka Labs is a proprietary trading firm with a focus on close collaboration, rigorous research, and cutting-edge technology. We are market makers, searchers, and solvers for top protocols on the most popular blockchains in the world. We design and implement our own trading systems and strategies to provide liquidity in the most diverse and challenging environments. At the core of it all lies our unwavering commitment to pushing boundaries of decentralized finance and we are always on the lookout for like-minded individuals to join us on this journey. If you think you have what it takes, apply now! 

 

Position Summary

As a Site Reliability Engineer (SRE), you will play a crucial role in maintaining and enhancing the security, stability, scalability, and cost-effectiveness of our systems. You will leverage your expertise in tools like Terraform, Ansible, Kubernetes, and AWS, as well as your networking skills, to build and manage a robust infrastructure.

 

Job Responsibilities

  • System Monitoring and Incident Response:
    • Continuously monitor the performance, availability, and security of systems.
    • Quickly respond to incidents, conducting root cause analysis, and implementing solutions to prevent recurrence.
  • Infrastructure Automation:
    • Automate infrastructure deployment and management using Terraform, Ansible, and related tools.
    • Optimize cloud environments, particularly AWS, to ensure efficient resource use and cost control
  • Kubernetes and Container Management:
    • Manage containerized applications using Kubernetes, ensuring high availability and scalability.
    • Develop and implement strategies for effective container orchestration and management.
  • Security and Compliance:
    • Implement and maintain security best practices across the infrastructure.
    • Conduct regular security audits and vulnerability assessments to protect against potential threats.
  • Network Management:
    • Design, implement, and manage network infrastructure to support system stability and performance.
    • Troubleshoot and resolve network-related issues, ensuring minimal downtime.
  • Capacity Planning and Performance Optimization:
    • Plan for future infrastructure needs, ensuring the system scales efficiently.
    • Continuously analyze system performance and apply improvements for better stability and cost efficiency.
  • Collaboration and Knowledge Sharing:
    • Work closely with software development, DevOps, and IT teams to align infrastructure strategies with business needs.
    • Document processes, share knowledge with team members, and mentor junior engineers.

 

Job Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 3+ years of experience in Site Reliability Engineering, DevOps, or a related role.
  • Proven experience with Terraform, Ansible, Kubernetes, and AWS.
  • Strong networking skills and experience with cloud networking.
  • Proficiency in scripting and automation (e.g., Python, Bash).
  • In-depth knowledge of Unix/Linux systems.
  • Strong analytical skills for performance tuning and cost optimization.
  • Excellent communication and collaboration skills.

Preferred Qualifications:

  • Experience with multi-cloud environments.
  • Familiarity with database management and data security.
  • Knowledge of CI/CD pipelines and automation tools.