at Apple
Location
Cambridge, United States of America
Compensation
$155k–$275k USD
Type
full time
Posted
Yesterday
Market range · company + function + seniority
p25 · target · p75 · n=533
Posted $275k · in the market band
Tailor your résumé to this role in 30 seconds.
Free account · ATS keyword check · per-job bullet rewrite by Claude.
As an Annotation Data Scientist on the Evaluation Integrity team, you will design and run HITL annotation projects that evaluate the quality and authenticity of agentic user personae, the validity of agent-to-agent conversations, and the reliability of LLM-as-judge and rule-based evaluators against Siri's product specifications. You will own annotation initiatives end-to-end; from rubric design and tooling, through annotator calibration, to data science analysis that turns annotator judgments into actionable signal for modeling, planning, and product teams.
Design HITL annotation tasks for agentic evaluation. Advise on rubrics and design workflows that ask annotators to assess (a) the quality and authenticity of user agent personae, (b) the validity of agent-to-agent conversations, and (c) whether agentic evaluators' verdicts align with Siri's product specifications and human interface guidelines.
Author, maintain, and iterate on annotation guidelines. Translate evolving Siri capabilities and product specs into clear, defensible rubrics for human grading aligned with agentic evaluators; run calibration sessions; monitor inter-annotator agreement; and refine guidelines based on edge cases surfaced during grading.
Manage multiple annotation programs in parallel. Plan, scope, and manage human evaluation tasks end-to-end — requirements gathering, annotator coordination, vendor management, timeline tracking, and stakeholder delivery.
Design custom annotation tooling in partnership with software engineers. Prototype task UIs, specify tool requirements, and collaborate with tooling engineers on the annotation platforms the Human Evaluation team relies on.
Apply data science rigor to human-labeled data. Use Python to build analysis pipelines that measure evaluator accuracy against the annotator pool, surface discrepancies between LLM-judge and rule-based evaluators, and quantify the reliability of each agentic evaluator as a source of truth.
Turn annotator feedback into evaluator improvements. Close the loop between annotators and the data scientists and software engineers who own user agents and automated evaluators, feeding findings back into prompts, rubrics, and product guidelines.
Contribute to the organization-wide eval health story. Partner with the User Feedback and Eval Science sub-team to ensure human signal is represented in the eval health report delivered to leadership.
Bachelor's or Master's degree in a quantitative or related field such as Data Science, Computer Science, Linguistics, Statistics, or Cognitive Science, or equivalent job-related experience.
3+ years of hands-on experience working with human-annotated datasets or human-in-the-loop evaluation methodologies for machine learning, natural language processing, or large language model systems.
3+ years of experience using Python for data processing, analysis, and prototyping, including experience with libraries such as pandas, Jupyter, and at least one data visualization library.
Experience designing, implementing, and communicating annotation schemas, rubrics, or ontologies for machine learning training or evaluation data.
Experience managing multiple concurrent dataset curation efforts, including scoping work, iterating on guidelines, coordinating with in-house or vendor annotators, and monitoring annotator performance metrics such as accuracy, throughput, and inter-annotator agreement.
Experience specifying or designing custom annotation tooling in collaboration with software engineers.
Experience evaluating LLM-powered or agentic systems, including familiarity with LLM-as-judge methodologies, rubric-based grading, or trajectory and tool-call evaluation.
Familiarity with statistical methods that address accuracy and variability in human annotation data, such as inter-annotator agreement, Cohen's or Fleiss' kappa, Krippendorff's alpha, or bootstrapping.
Data-querying experience with SQL, Spark, or similar, and comfort working with large, complex, real-world datasets.
Experience building pre-ship evaluation pipelines for conversational or assistant products.
Experience with prompt engineering, or with designing simulated user personae for agent evaluation.
Experience running annotation programs across multiple locales or at large scale.
Excellent written and verbal communication skills, with the ability to explain technical topics clearly to data scientists, engineers, annotators, and cross-functional partners.
Proven ability to collaborate effectively across functions and drive projects of varying sizes and scopes — knowing when to dive deep and when to delegate.
Play a part in the ongoing revolution in human-computer interaction. Siri is evolving — and the way we evaluate it has to evolve with it. Join the Evaluation Integrity team to help build the trusted quality signal behind every Siri release.
Within the Siri evaluation organization, the Human Evaluation sub-team is responsible for answering the question: can we trust our evals? We do that by designing human-in-the-loop (HITL) annotation tasks that scrutinize every moving part of an agentic evaluation — the simulated user agent, the conversation it has with Siri, and the automated evaluators that grade the exchange. This role sits at the intersection of data science, human annotation engineering, and evaluation methodology, and is instrumental in turning human judgment into a rigorous, reproducible signal that directly informs pre-ship model and product decisions.
Apple is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant
At Apple, we believe accessibility is a fundamental human right. You’ll find that idea reflected in everything here — in our culture, our benefits and our digital tools. By welcoming as many perspectives as possible, we help you build a career where you feel like you belong.
Learn about accessibility in Apple’s workplace
Learn about reasonable accommodations for job applicants
Apple accepts applications to this posting on an ongoing basis.
More open roles at Apple
Hiring velocity, headcount trend, and every open posting on one page.
Open postings ranked by description similarity — useful if this role isn't quite right.