This week’s spotlight will be all about software integrated with storage services. GFAPI has opened the floodgates for this type of integration with GlusterFS. In this spotlight, we’ll hear from people who have been actively working on integrations with Apache CloudStack, Pydio, and OpenNebula.
Hear about how they integrated with GlusterFS and they would suggest to others who wish to deploy any application stack with scale-out storage requirements.
As usual, you can request to be part of the live hangout, or follow along on YouTube. Q&A will be managed from the IRC channel #gluster-meeting.
Deploying Pydio in a highly-demanding environment (lots of users, tons of documents) to achieve a dropbox-like server at scale requires a solid and elastic architecture.
As a distributed file-system and software-defined storage, GlusterFS is a low-cost way of providing robust storage architecture on standard hardware. On its side, having kept the FileSystem driver at its core since the beginnings of the project, Pydio is a perfect match to be deployed on top of Gluster, to provide user-friendly features and enterprise-grade security.
The principle here is to provide High Availability and Scalability combining GlusterFS (for the storage part) and Pydio (for the access part) through a load-balanced cluster of nodes.
We choose here to install Pydio ( = compute ) and the Gluster bricks ( = storage) on the same instances, but every configuration can be imagined : N dedicated nodes for storage, and a subset of them running Pydio, or none of them running Pydio and K nodes of compute, etc.
Also, we choose to set up two Gluster volumes (each of them assembling 2p bricks), for an easier maintenance: one will contain some Pydio shared configurations, allowing the startup of a new Pydio node without hassle, and one will contain the actual users data (files). On EC2, we will use EBS volumes as primary bricks for the data gluster volume, and instances available disk space for the configs gluster bricks. Finally, a DB must be set up to receive all the annex Pydio data (namely users and ACLs, event logs, etc). This DB can be running on another instance, or eventually installed on one of the nodes. It should be replicated and backed-up for a better failover scenario.
The following schema shows an overview of the targeted architecture.
Create two (or four) EC2 instances, attaching to each an EBS of X Gb depending on the size you require. We chose Ubuntu 12.04 as the OS. Make sure to use a quite open security group, we’ll restrict permissions later. Instances will start with both PRIVATE and PUBLIC ips/dns. Update apt package lists with sudo apt-get update
Prepare Gluster bricks
We’ll use one for the actual data, and one for Pydio configurations data
$ sudo apt-get install glusterfs-server xfsprogs
$ sudo mkfs.xfs /dev/xvdb $ sudo mkdir /mnt/ebs
$ sudo mount /dev/xvdb /mnt/ebs
And add the line to /etc/fstab to automount at startup
/dev/xvdb /mnt/ebs xfs defaults 0 0
Let’s also create a dedicated folder for the configs volume, on both nodes
Once finished, start or restart apache $ apachectl start
Go to the public IP of the node through a web-browser http://PUBLIC_IP1/pydio/, and follow the standard installation process. Setup admin login and global options, and for the Configurations Storage, choose Database > Mysql , and use the public IP of the DB node as server host.
Then save an connect as admin, switch to the « Settings » workspace, and do some customization as you like in the configuration. You can activate some additional plugins, customize logo and application title, etc. The interesting part of doing that now is that any changes will be automatically reported to the other nodes you switch on.
As they will share their base configuration through the gluster pydio-config volume, the next nodes will directly inherit from the first node configs. So to add fire a new node, all you will have to do will be the script part:
Then verify that pydio is up and running, and that you can log in with the same credentials, at http://PUBLIC_IP2/pydio/
We could use a custome compute node equiped with HAProxy or some similar software, but as our tutorial is running on AWS, we will use the available service to that: LoadBalancer. In your AWS console, create a LoadBalancer, forwarding port 80 to instances port 80.
To configure how healthcheck will be performed (how does the LB check that instances are alive), make sure to change the name of the file checked to check.txt. It is important because thanks to our install scripts, the nodes Apache servers are configured to skip the log of calls to this file, to avoid filling the logs with useless data (happening every 5s).
NOTE If you have an SSL certificate, which is definitely a good security rule, you will install it on this LoadBalancer, and redirect port 443 to 80: internal communications do not need to be encrypted.
Once edited and created, edit the « Stickyness » parameter of the redirection rules and choose « Enable Application Generated Cookie Stickyness », using « AjaXplorer » as cookie name. This is important, as although clients will be randonly redirected to instances on first connexion, once a session is established, it will always stay on a given instance.
NOTE Session stickyness avoid us to set up a session-sharing mechanism between nodes, but this could be done for example adding a memcache server.
Outside world address
Now that our various nodes will be accessed through a proxy and not through their « natural » public IP, we need to inform Pydio of that. This is necessary to generate correct sharing URLs, or sending emails pointing to the correct URL. Without that, Pydio would try to auto-detect the IP, and would probably end up displaying the PRIVATE IP of the current working node.
Login as admin to Pydio, and go the Settings > Global Configurations > Pydio Main Options. Here, update the fields Server URL and Download URL with the real addresses, and save. Go to a file workspace and try to share a file or a folder, and verify the link is correct and working.
Conclusion: adding new nodes on-demand
Well, that’s pretty much. We could refine this architecture on many points, but basically you’re good to go.
So what do you do to add a new node? Basically you’ll have to
[if you need more storage]
Fire up a new instance with the ubuntu OS
Configure Gluster to add it as a new brick to the volume
[if you need more compute]
Fire up a new instance with the ubuntu OS
Configure the gluster client to mount the volumes,
“Take developers through a tour of existing DiskFile backends for OpenStack Object Storage (Swift). The DiskFile interface in Swift is an API for changing how objects are stored on storage volumes. Swift provides a default implementation over XFS (Posix) and a reference in-memory example to help folks get started.”
“This presentation introduces Manila, the new OpenStack File Shares Service. Manila is a community-driven project that presents the management of file shares (e.g. NFS, CIFS) as a core service to OpenStack. Manila currently works with NetApp, Red Hat Storage (GlusterFS) and IBM GPFS (along with a reference implementation based on a Linux NFS server).”
“The main focus of this session will be to explain how Docker can be leveraged to utilize unused cycles on GlusterFS Storage nodes for additional compute nodes in an Openstack environment. Docker is an application container and can host both GlusterFS Storage node as well as Openstack compute nodes in a single physical server.”
“There is a need to extend GlusterFS storage availability to other Operating Systems and Hyper-visors. In this session, you will learn about a generalized block solution for Gluster that works for any block-based application (Xen, HyperV, VirtualBox, VmWare, tape). We will compare different interconnect choices between the GlusterFS server and openstack client, such as iSCSI, FcOE, and ‘gluster native’.”
“Red Hat uses OpenStack Swift as the object storage interface to GlusterFS. Instead of reimplementing the Swift API, Red Hat is participating in the OpenStack Swift community to ensure that GlusterFS can take full advantage of the latest Swift features. This is absolutely the right way to pair Swift with another storage system.”
I was pleased to read about the progress of Graylog2, ElasticSearch, Kibana, et al. in the past year. Machine data analysis has been a growing area of interest for some time now, as traditional monitoring and systems management tools aren’t capable of keeping up with All of the Things that make up many modern workloads. And then there are the more general purpose, “big data” platforms like Hadoop along with the new in-memory upstarts sprouting up around the BDAS stack. Right now is a great time to be a data analytics person, because there has never in the history of computing been a richer set of open source tools to work with.
There’s a functional difference between what I call data processing platforms, such as Hadoop and BDAS, and data search presentation layers, such as what you find with the ELK stack (ElasticSearch, Logstash and Kibana). While Hadoop, BDAS, et al. are great for processing extremely large data sets, they’re mostly useful as platforms for people Who Know What They’re Doing (TM), ie. math and science PhDs and analytics groups within larger companies. But really, the search and presentation layers are, to me, where the interesting work is taking place these days: it’s where Joe and Jane programmer and operations person are going to make their mark on their organization. And many of the modern tools for data presentation can take data sets from a number of sources: log data, JSON, various forms of XML, event data piped directly over sockets or some other forwarding mechanism. This is why there’s a burgeoning market around tools that integrate with Hadoop and other platforms.
There’s one aspect of data search presentation layers that has largely gone unmentioned. Everyone tends to focus on the software, and if it’s open source, that gets a strong mention. No one, however, seems to focus on the parts that are most important: data formats, data availability and data reuse. The best part about open source analytics tools is that, by definition, the data outputs must also be openly defined and available for consumption by other tools and platforms. This is in stark contrast to traditional systems management tools and even some modern ones. The most exciting premise of open source tooling in this area is the freedom from the dreaded data roach motel model, where data goes in, but it doesn’t come out unless you pay for the privilege of accessing the data you already own. Recently, I’ve taken to calling it the “skunky data model” and imploring people to “de-skunk their data.”
Last year, the Red Hat Storage folks came up with the tag line of “Liberate Your Information.” Yes, I know, it sounds hokey and like marketing double-speak, but the concept is very real. There are, today, many users, developers and customers trapped in the data roach motel and cannot get out, because they made the (poor) decision to go with a vendor that didn’t have their needs in mind. It would seem that the best way to prevent this outcome is to go with an open source solution, because again, by definition, it is impossible to create an open source solution that creates proprietary data – because the source is open to the world, it would be impossible to hide how the data is indexed, managed, and stored.
In the past, one of the problems is that there simply weren’t a whole lot of choices for would-be customers. Luckily, we now have a wealth of options to choose from. As always, I recommend that those looking for solutions in this area go with a vendor that has their interests at heart. Go with a vendor that will allow you to access your data on your terms. Go with a vendor that gives you the means to fire them if they’re not a good partner for you. I think it’s no exaggeration to say that the only way to guarantee this freedom is to go with an open source solution.
Join us on Friday, February 7, 1pm Est/10am PST/18:00 Gmt for a very special Gluster Spotlight featuring our 4 new board members: James Cuff (Harvard FASRC), Mark Hinkle (Citrix), Anand Avati (Red Hat – individual contributor) and Theron Conrey (individual contributor).
As always, you can watch the video feed here and ask questions on the #gluster-meeting IRC channel on the Freenode network. See you there!
Citrix, Harvard University FASRC and long-time contributors join the Gluster Community Board to drive the direction of open software-defined storage
February 5, 2014 – The Gluster Community, the leading community for open software-defined storage, announced today two new organizations have signed letters of intent to join: Citrix, Inc. and Harvard University’s Faculty of Arts and Science Research Computing (FASRC) group. This marks the third major expansion of the Gluster Community in governance and projects since mid-2013. Monthly downloads of GlusterFS have tripled since the beginning of 2013, and traffic to gluster.org has increased by over 50% over the previous year. There are now 45 projects on the Gluster Forge and more than 200 developers, with integrations either completed or in the works for Swift, CloudStack, OpenStack Cinder, Ganeti, Archipelago, Xen, QEMU/KVM, Ganesha, the Java platform, and SAMBA, with more to come in 2014.
Citrix and FASRC will be represented by Mark Hinkle, Senior Director of Open Source Solutions, and James Cuff, Assistant Dean for Research Computing, respectively, joining two individual contributors: Anond Avati, Lead GlusterFS Architect, and Theron Conrey, a contributing speaker, blogger and leading advocate for converged infrastructure. Rounding out the Gluster Community Board are Xavier Hernandez (DataLab); Marc Holmes (Hortonworks), Vin Sharma (Intel), Jim Zemlin (The Linux Foundation), Keisuke Takahashi (NTTPC), Lance Albertson (The Open Source Lab at Oregon State University), John Mark Walker (Red Hat), Louis Zuckerman, Joe Julian, and David Nalley.
Citrix has become a major innovator in the cloud and virtualization markets. They will drive ongoing efforts to integrate GlusterFS with CloudStack (https://forge.gluster.org/cloudstack-gluster) and the Xen hypervisor. Citrix is also sponsoring Gluster Community events, including a Gluster Cloud Night at their facility in Santa Clara, California on March 18.
The research computing group at Harvard has one of the largest known deployments of GlusterFS in the world, pushing GlusterFS beyond previously established limits. Their involvement in testing and development has been invaluable for advancing the usability and stability of GlusterFS.
Anand Avati was employee number 3 at Gluster, Inc. in 2007 and has been the most prolific contributor to the GlusterFS code base as well as its most significant architect over the years. He is primarily responsible for setting the roadmap for the GlusterFS project. Avati is employed by Red Hat but is an individual contributor for the board.
Theron became involved in the Gluster community when he started experimenting with the integration between oVirt (http://ovirt.org/) and GlusterFS. Long a proponent of converged infrastructure, Theron bring years of expertise from his stints at VMware and Nexenta.
John Mark Walker, Gluster Community Leader, Red Hat
“The additions of Citrix and Harvard FASRC to the Gluster Community show that we continue to build momentum in the software-defined storage space. With the continuing integration with all cloud and big data technologies, including the Xen Hypervisor and CloudStack, we are building the default platform for modern data workloads.
Mark Hinkle, Senior Director, Open Source Solutions, Citrix
“We see an ever increasing hunger for storage solutions that have design points that mirror those in our open source and enterprise cloud computing efforts. Our goal is to enable many kinds of storage with varying levels of utility and we see GlusterFS as helping to pioneer new advances in this area. As an active participant in the open source community we want to make sure projects that we sponsor like Apache CloudStack and the Linux Foundation’s Xen Project are enabled to collaborate with such technologies to best serve our users.
James Cuff, Assistant Dean for Research Computing, Harvard University
“As long term advocates of both open source, and open science initiatives at scale, Research Computing are particularly excited to participate on the Gluster Community Governing Board. We really look forward to further accelerating science and discovery through this important and vibrant community collaboration.”
***The OpenStack mark is either a registered trademark/service mark or trademark/service mark of the OpenStack Foundation, in the United States and other countries, and is used with the OpenStack Foundation’s permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community
***Gluster and GlusterFS are trademarks of Red Hat, Inc.
***Xen and Linux are trademarks of The Linux Foundation
***Apache Cloudstack is a trademark of the Apache Software Foundation
Join us on March 4 for the Gluster Community seminar and learn how to improve your storage.
This half day seminar brings you in-depth presentations, use cases, demos and developer content presented by Gluster Community experts.
Register today for this free half-day seminar and reserve your seat since spaces are limited. Click here to register.
We look forward to meeting you on March 4th!
13:30 – 13:45 Registration
13:45 – 14:15 The State of the Gluster Community
14:15 – 15:30 GlusterFS for SysAdmins, Niels de Vos, Red Hat
15:30 – 15:45 Break
15:45 – 16:30 Adventures in Cloud Storage with OpenStack and GlusterFS
Tycho Klitsee (Technical Consultant and Co-owner) of Kratz Business Solutions
16:30 – 17:15 Gluster Forge Demos, Fred van Zwieten, Technical Engineer, VX Company and Marcel Hergaarden, Red Hat
17:15 – 18:00 Networking Drinks
Please follow the instructions on the GlusterFest page (gluster.org/gfest) and report your results there. Some of the test results are quite large, so you will want to report test results on a separate page, either on the Gluster.org wiki or on the paste site of your choosing, such as fpaste.org.
Please file any bugs and report them on the gluster-devel list, as well as providing links on the GlusterFest page.
In addition to performance, we have new features in 3.5 which needs some further testing. Please follow the instructions on the GlusterFest page and add your results there. Some of the developers were kind enough to include testing scenarios with their feature pages. If you want your feature to be tested but didn’t supply any testing information, please add that now.
The GlusterFest begins at 00:00 GMT/UTC (today, January 17) and ends at 23:59 GMT/UTC on Monday, January 20.