WebScalding: A Framework for Big Data Web Services

Date:

CareerBuilder (CB) manages over 50 million active résumés and 2 million active job postings, driving a continuous need to match the most relevant jobs for seekers and the most qualified candidates for employers. Achieving this at scale naturally presents significant Big Data challenges.

To address these, we developed WebScalding, a high-level Big Data framework built upon and extending Twitter’s Scalding framework to meet CB’s unique large-scale data processing requirements. WebScalding raises the level of abstraction by:

  1. Exposing all internal web services as Cascading pipe operations,

  2. Allowing these operations to seamlessly read from common data sources to build pipe assemblies, and

  3. Enabling execution of these assemblies both on CB’s Hadoop cluster and on local machines without modification.

We illustrate the capabilities of WebScalding through three internal case studies, showing how data scientists—even those without deep expertise in Big Data frameworks—can efficiently design, implement, and test scalable data applications. Furthermore, performance comparisons demonstrate that WebScalding programs achieve super-linear speedups relative to equivalent sequential Python implementations, highlighting its effectiveness in accelerating large-scale data workflows.

Slides