DATAVERSITY hosted its inaugural Data Architecture Summit, an event designed for CDOs, CIOs and data architects, last week in Chicago. Overall, the event was quite impactful, with high-quality presentations and case studies that resonated with the approximately 150 attendees. The leading theme and ongoing discussions centered on how to tackle data governance in a business world of ever-expanding data sources and the opportunity to leverage AI and machine learning for assistance. A second overarching theme explored the role of data lakes and their mainstream acceptance within modern data architectures. Perhaps surprisingly, a significant number of sessions dealt with approaches to deal with data governance and metadata through the use of graph databases, and knowledge and property graphs specifically.
Nearly 80 percent of attendees raised their hands at the summit panel discussion when asked if data governance was among their top 2018 initiatives for data architecture in their organization. The upcoming GDPR deadline and its serious impacts is likely a significant contributor to this overall sense of urgency. One attendee believed that the GDPR penalties were going to be significant enough to trigger a global economic bear market. Beyond the GDPR deadline, the trend of enterprise self-service to enable businesspeople in working with data also drives new approaches to data governance and master data management. Governance has become a race in some ways, striving to ensure protections and proper usage as companies continually add access to more and more data sets from external sources and their own data lakes. The key to data governance going forward will be speed in delivering data to users while managing an appropriate amount of protection. Several sessions and attendee discussions covered data classification as a “divide and conquer” approach to govern data when and where it’s needed. Attendees revealed that methods for setting up zones in the data lake vary among companies. Some classified data vertically across business units or functions while others classified data horizontally from raw data to curated and certified for use. Either way, speakers and attendees agreed that automation and AI will be required to keep up with the amount of data flowing throughout the enterprise.
Data lakes took the second spotlight of discussions and sessions with consistency in data architectures representing data lake zones or layers. Many of the data lakes being discussed were on-premises and Hadoop-based, supported by Cloudera and Hortonworks. As discussions moved to cloud-based data lakes, speakers and attendees identified challenges regarding architecting for data security, public cloud regions with ingress/egress data flow and coordinating multi-cloud architectures. Fortunately, there weren’t too many data swamp naysayer comments; we seem to have moved on and can now productively discuss lessons learned. In a session on “Transparent Data Lakes as a starting point for AI,” PwC Researcher Alan Morrison summed it well by reminding attendees that data lakes often fail because they lack a business objective and a plan for how people will leverage the data lake. Indeed, a subtle but very important point regarding the early data lake strategy and maturity process is to start small with a business objective, working on challenges like data ingestion, security, governance and user access at a small scale before moving on to data lake additions as an architecture journey. Many sessions did a good job going beyond the architectural organization of data lake zones and explaining how metadata, data catalogs or semantics were critical components to working with the data zones by casual business users to data scientists. That said, many people shared - based on experience - that few data lakes actually make it into production environments.
Finally, the spotlight on how knowledge graphs are becoming an effective way catalog and work with data context and semantics came as a bit of a surprise at the architecture summit. Beginning with a graph database tutorial by respected industry veteran David Loshin, sessions explored various approaches to leveraging graph databases and triple store databases for their flexible nature in easily handling data context and relationships. This is an extension of the conversations around enterprise semantics and ontology that have become more common as companies are finding ways to tackle their growing amount of data with a semantic context layer for users. This realization was quite clear, as attendees often stated that all data is practically useless unless we find ways to understand and leverage it properly. A number of sessions and discussions also emphasized the property graph as a new use case for metadata where context and relationships are captured in the graph, allowing users to search and understand massive amounts of data in the data lake. Graph data management and ontology management became frequent follow-on discussions during networking and roundtable lunches. The Hadoop platform with Spark for GraphX and MLlib were consistently confirmed as the way that many companies were tackling their data lake, governance and knowledge graph needs.
Through the impactful content within the sessions and discussions, attendees came away with deeper knowledge of best practices and new techniques for data architecture technologies and approaches. With such a great turn out at this inaugural event, I can only imagine that the Data Architecture Summit in 2018 will only continue to grow.
John O’Brien is Principal Advisor and CEO of Radiant Advisors. A recognized thought leader in data strategy and analytics, John’s unique perspective comes from the combination of his roles as a practitioner, consultant and vendor CTO.